RICE  UNIVERSITY 


GEORGE  R.  BROWN  SCHOOL  OF  ENGINEERING 


in  Son- Gaussian  Random  Fields 

Kina!  Report 


t ) N U  N 0001 -I  '■'»  .1  -tl-VJ 


j  Thu 

I  far  pub!.-  < 

|  <fi»Uibuticn  r 


:x 

4  * 


.} 


DEPARTMENT  OF 

ELECTRICAL  AND  COMPUTER  ENGINEERING 


HOUSTON,  TEXAS 


.93  6 


93-14879 


o  o 


09  9 


Analysis  of  Temporal  Symmetry 
in  Non- Gaussian  Random  Fields 

Final  Report 
ONR  N00014-89-J-3152 


Don  H.  Johnson 

Computer  and  Information  Technology  Institute 
Department  of  Electrical  and  Computer  Engineering 
Rice  University 
Houston,  TX  77251-1392 

June  22, 1993 


|  Thi»  docoavni  has  b*t»n  approved 
i  tor  publr  ralccse  and  rale;  its 
|  diiliib  iiivo  ir  un!iraits«i 


kGTSDO 


Accoon  i(»  ’J  | 

NT. 3 

rpA.3i  \J 

/•  ■'  ’ 

*  t 

*  •  ■  i  „ 

J.4.* \'iil 

!-W 

By 

■  *  •  .  *" 

0;.: 

■  4 

,  t 

D*il 

U.  • 

/M 

t 

i 

V  •  I--  -  4 


1 


N0001 4-89-J-31 52 


Final  Report 


1.  Project  Goals 

The  goal  of  the  proposed  work  was  to  determine  if  the  temporal  asymmetry  of  signals 
could  be  exploited  by  signal  processing  algorithms.  We  specifically  intended  to  specify 
the  kinds  of  dependence  structures  having  physical  basis  (rather  than  those  chosen  for  the 
modeler’s  convenience)  and  to  develop  detection  or  estimation  algorithms  sensitive  to  these 
structures  that  yielded  signal  processing  gains.  Over  the  grant’s  three-year  duration,  lasting 
from  1  September  1989  until  30  September  1992,  a  total  of  $148,248  in  ONR  funds  and 
$7,700  in  Rice  University  matching  funds  were  expended.  Becuase  of  this  support,  these 
research  goals  were  accomplished,  students  receiving  ONR  support  graduated  and  have 
engineering  positiuns,  and  a  significant  volume  of  technical  literature  appeared  in  reviewed 
journals  and  conference  proceedings. 

2.  Research  Results 

Fundamentals  of  temporal  symmetry  were  outlined  in  a  masters  thesis  [8].  There  and  in 
previous  conference  papers  (1,  10],  the  fundamentals  of  temporal  symmetry  analysis  tech¬ 
niques  for  time  series  were  developed.  We  uncovered  for  the  signal  r"-'->cessing  community 
an  important  result  published  by  another  researcher  over  ten  years  earlier:  The  only  linear, 
temporally  symmetric  random  process  was  the  Gaussian.  This  result  means  that  all  linear, 
non-Gaussian  processes  were  temporally  asymmetric,  a  property  theretofore  unexplored  by 
the  signal  processing  community.  Linear  Markov  processes  comprised  the  focal  point  of  our 
work,  and  they  are  generated  by  passing  white  noise  Wn  through  a  first-order  digital  filter. 

Xn  =  aXn,  x  +  Wn 

To  illustrate  temporal  asymmetry,  we  focused  on  the  hyperbolic  secant  process,  a  partic¬ 
ular  linear  non-Gaussian  process  unmentioned  in  the  literature.  Superficially,  this  example 
greatly  resembles  a  Gaussian  one,  but  has  very  different  properties.  Another  example, 
due  to  Rosenblatt,  proved  quite  insightful.  Here,  the  linear,  first-order,  process  has  a  uni¬ 
form  amplitude  distribution.  Through  these  examples,  the  following  properties  were  proven 
valid  [11]: 

•  The  forward  conditional  expected  value  £[AW  |  A*_j]  will  be  linear  for  all  first-order 
linear  Markov  processes.  The  backward  conditional  expected  value  £[An_j  |  A'*], 
however,  will  be  linear  only  in  the  Gaussian  case.  Thus,  process  linearity  can  be 
tested  by  examination  of  the  forward  conditional  mean.  Furthermore,  a  senstftue  test 
for  Gaussianity  is  to  compare  these  conditional  expected  values  for  linearity. 

•  The  backward  mean-square  prediction  error  of  a  non-Gaussian  linear  Markov  process 
is  always  less  than  the  forward  prediction  error.  The  Rosenblatt  example  is  particu¬ 
larly  striking  in  this  regard:  The  backward  mean-square  prediction  error  is  scro  while 
the  forward  prediction  error  is  noniero.  We  have  further  shown  that  the  time-reversed 
system,  which  takes  X%  and  produces  A'*_j,  is  deterministic,  nonlinear,  and  chaotic. 
Thus,  ooe  set  of  ordered  number*  can  both  be  produced  by  a  stochastic-driven  system 
and  a  deterministic,  iterated  one.  From  another  perspective,  a  signal  viewed  looking 
forward  in  time  is  random,  while  viewed  looking  backward  in  time  is  chaotic.  We  art 
now  pursuing  the  research  question  of  what  truly  distinguishes  stochastic  from  chaotic 
signals. 
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•  We  found  that  unless  the  process  has  a  class  L  distribution,  it  must  be  generated  by 
a  so-called  random  coefficient  system.  The  hyperbolic  secant  density  is  a  member  of 
this  class,  and  therefore  has  some  physical  basis.  We  demonstrated  a  specific  test  for 
class  L  membership. 

•  We  demonstrated  a  specific  algorithm  for  generating  a  linear  process  having  a  specified 
correlation  coefficient  and  amplitude  distribution.  As  one  theoretical  application  of 
this  algorithm,  we  settled,  in  the  affirmative,  the  technical  question  as  to  whether  a 
linear  process  could  have  a  multimodal  amplitude  distribution. 

Because  of  the  importance  of  temporal  symmetry,  accurate  measurement  of  the  condi¬ 
tional  mean  becomes  an  essential  aspect  of  non-Gaussian  signal  processing.  We  developed 
a  novel  nonparametric  technique  of  efficiently  estimating  the  conditional  mean  [5,  6,  7]. 
Here,  we  used  the  theory  of  nonparametric  regression,  and  showed  that  it  could  be  applied 
to  both  stochastic  and  chaotic  systems  analysis.  We  developed  a  technique  of  identifying 
the  input-output  relation  of  the  system  that  generates  a  set  of  observations  by  operating 
on  a  white  noise  input.  Because  the  technique  is  nonparametric,  it  makes  few  assumptions 
about  the  generation  system;  the  algorithm  does  need  to  have  the  system’s  order.  These 
results  have  been  submitted  for  publication. 

In  another  line  of  work,  we  investigated  a  technique  for  determining  the  order  of  a 
Markov  linear  process  that  did  not  depend  on  the  ubiquitous  Gaussian  assumption.  Our 
algorithm  is  based  on  the  conditional  entropy  of  the  process  and  has  been  published  [4]. 
This  algorithm  applies  to  nonlinear  as  well  as  linear  Markov  processes.  Its  sole  drawback 
is  computational  complexity. 

Toward  the  end  of  the  granting  period,  a  new,  potentially  important  result  emerged 
that  is  based  on  the  notion  of  temporal  symmetry  (2,  3].  We  showed  that  all  physically 
obtained  time  series  must  result  from  time-irreversible  random  processes.  Consequently, 
models  that  produce  time- reversible  processes,  in  particular  the  Gaussian,  have  no  physical 
basis.  Stationary  Gaussian  processes  cannot  serve  as  models  of  physical  measurements. 
This  important  result  is  being  prepared  for  formal  publication. 

We  developed  a  specific  algorithm  for  designing  optimal  detectors  for  linear,  non- 
Gaussian,  continuous-time  processes  [9].  Here,  specification  of  the  random  process  is  only 
obtained  with  difficulty.  Calculation  of  the  detector  requires  detailed  analysis  of  the  Pois¬ 
son  random  measures  that  underly  the  observations.  These  results  have  been  submitted  for 
publication. 

3.  Students  Supported 

Over  the  project’s  three-year  period,  three  graduate  students  were  supported  by  ONR 
funds.  An  undergraduate  worked  on  aspects  of  the  project,  but  was  not  supported  by 
research  funds. 

Anand  R.  Kumar  received  support  for  hit  work  on  model-order  estimation.  Graduated 
with  a  Ph.D.  in  1990  and  is  now  working  for  Motorola  in  India. 

P.  Srinivasa  Rao  received  support  for  his  work  in  temporal  symmetry  and  in  robust 
detection.  He  was  awarded  his  doctoral  degree  in  1992  and  is  now  working  at  the 
IBM  Watson  Research  Center  in  New  York. 
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Y.  Kang  Lee  received  support  for  his  work  in  nonpar ametric  system  identification.  He 
received  his  masters  degree  in  1992  and  is  now  pursuing  his  doctoral  degree,  studying 
the  fundamental  properties  of  chaotic  and  non-Gaussian-stochastic  signals. 

David  D.  Becker  developed  a  numerical  algorithm  for  calculating  the  amplitude  distri¬ 
bution  of  Wn  that  could  produce  a  first-order  Markov  linear  process  having  a  specified 
amplitude  distribution.  This  work  served  as  the  topic  of  his  Senior  Honors  Project, 
and  was  his  first  research  experience.  He  wem  on  to  receive  a  masters  degree  from 
Stanford  in  1991  and  now  works  for  General  Electric  Medical  Systems. 

4.  Infrastructure 

To  complement  ONR’s  award,  Rice  University  provided  funds  to  purchase  a  SUN  (Sparc  1) 
workstation  for  use  on  the  project.  This  computer  was  used  throughout  the  granting  period 
and  is  still  used  today  in  non-Gaussian  signal  processing  research. 
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ABSTRACT 

The  dependence  structure  and  the  amplitude  distribution  of 
stationary  random  sequences  are  linked,  with  specification 
of  one  placing  constraints  on  the  other.  Time-reversible 
processes  can  be  Gaussian  or  non- Gaussian,  but  all  Gaus¬ 
sian  processes  must  be  time  reversible.  We  examine  the 
thermodynamics  of  measurement,  showing  that  while  “in¬ 
formation”  can  be  extracted  from  a  system  without  altering 
system  entropy,  meat  measurement  techniques  irreversibly 
alter  thermodynamic  state  with  a  consequent  entropy  In¬ 
crease.  Because  of  the  second  law  of  thermodynamics,  such 
entropy  changes  cannot  be  undone  and  measurements  re¬ 
flecting  thermodynamic  state  cannot  be  time  reversible. 
We  conclude  that  physical  measurements  are  not  time  re¬ 
versible,  implying  that  only  non-time-reversible  processes 
model  physically  relevant  signals.  Consequently,  Gaussian 
processes  would  seem  to  be  imprecise  representations  of 
physical  measurements. 

I.  INTRODUCTION 

The  Gaussian  process  is  unquestionably  the  most  prevalent 
model  of  both  signals  and  noise  in  communication,  control, 
and  signal  processing  theory.  For  many  reasons,  this  pro¬ 
ems  yields  analytically  tractable  results  for  a  wide  variety 
of  application  problems.  Several  important  signal  process¬ 
ing  tools  based  on  the  Gaussian  model  are  the  matched 
filter,  the  Kalman  filter,  and  the  Wiener  filter.  Despite 
its  theoretical  importance,  the  fundamental  equations  of 
physics  impose  few  constraints  on  the  amplitude  distribu¬ 
tion  of  noise  and,  to  the  authors'  knowledge,  the  stationary 
Gaussian  stochastic  process  does  not  emerge  as  the  solu¬ 
tion  of  any  physical  problem.  The  Central  Limit  Theorem 
(CLT)  stands  out  as  a  possible  exception  to  this  supposi¬ 
tion.  However,  the  convergence  of  independent  superim¬ 
posed  processes  to  the  Gaussian  is  asymptotic;  an  infinite 
number  of  processes  does  not  exist  physically  and  the  CLT 
cannot  be  used  to  justify  the  Gaussian  model  on  physical 
grounds. 

Recently,  researchers  have  realised  the  prevalence  of 
demonstrably  non-Gaussian  noise  in  physical  measure¬ 
ments  [6.7]  and  signal  processing  research  has  increasingly 
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turned  to  developing  signal  processing  algorithms  that  ap¬ 
ply  to  nou-Gaussian  noise  problems.1  While  th*  equa¬ 
tions  governing  physical  phenomena  do  not  directly  con¬ 
strain  the  probabilistic  amplitude  distribution  of  physical 
variables,  they  do  constrain  the  statistical  spatio-temporal 
dependence  properties  (correlation,  for  example)  of  signals 
we  might  measure.  We  usually  interpret  temporal  depen¬ 
dence  through  the  power  density  spectrum;  in  this  view, 
only  a  constant  power  density  spectrum  would  be  free  of 
temporal  dependence.  For  example,  temporal  correlations 
are  induced  on  propagating  ocean  acoustic  noise  by  the 
filtering  characteristics  of  the  medium  [13].  A  somewhat 
different  dependence  property  of  stochastic  signals  is  the 
notion  of  tempers/  symmetry,  where  time-reversed  sample 
functions  are  tested  for  membership  in  the  original  pro¬ 
cess.  As  we  shall  see,  a  process’s  temporal  symmetry  canno’ 
be  judged  from  its  spectrum.  When  viewed  from  the  per¬ 
spective  of  familiar  Gaussian- based  random  process  prop¬ 
erties,  this  sample-function  property  may  seem  subtle  since 
power  tpectrum  measurement*  cannot  determine  tempo¬ 
ral  symmetry.  However,  the  process's  temporal  symme¬ 
try  restricts  what  amplitude  distributions  the  process  may 
have.  Because  physical  laws  tend  to  place  constraints  on 
admissible  signal’s  temporal  properties,  we  use  these  to  pre¬ 
dict  what  amplitude  distributions  physically  possible  sig¬ 
nal s  may  have. 

II.  TEMPORAL  SYMMETRY 

A  stationary  process  {A\,  t  €  T),  is  temporally  symmet¬ 
ric  if  for  every  f j .....  ,  for  all  rs,  the  random  vectors 
{A,,,...,XU}  and  [Xt,_,lt...,Xw_u},Vf«  have  the  same 
joint  probability  distributions  (4,14).  Thus,  for  a  tempo¬ 
rally  symmetric  process,  time- re  versed  sad  delayed  sample 
func.  'ns  are  also  sample  functions  of  the  original  process. 
With  this  definition,  temporal  symmetry  is  a  stationary 
process  property  having  no  gradations:  a  process  is  tem¬ 
porally  symmetric  or  it's  not.  Fbr  example,  consider  a 
rero-tnean.  Gaussm  process;  for  all  such  processes,  the 
covariance  function  completely  characterises  the  joint  am¬ 
plitude  distribution.  As  a  stationary  process's  covariance 
function  depends  only  on  the  mapatade  of  the  difference 
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between  the  two  cample  times — JT(,]  =  /(|ft  -  <j|) — 
be  the  process  temporally  symmetric  or  not,  the  covariance 
function  is  a  temporally  symmetric  quantity.  Consequently, 
all  stationary  Gaussian  processes  are  temporally  symmet¬ 
ric.  Weiss  [14]  showed  that  the  only  discrete-time  linear 
process— stationary  time  series  generated  by  passing  white 
noise  through  a  linear,  time-invariant  system — that  could 
be  temporally  symmetric  is  the  Gaussian.  Consequently, 
all  non- Gaussian  linear  processes  must  be  temporally  asym¬ 
metric.  Nonlinear  processes  may  or  may  not  be  temporally 
symmetric  [4j.  No  basic  result  akin  to  Weiss’s  for  catego¬ 
rizing  nonlinear  processes  is  known. 

Because  the  covariance  function  is  by  definition  a  tem¬ 
porally  symmetric  quantity,  a  process’s  temporal  symmetry 
cannot  be  examined  using  second-order  statistics,  such  as 
the  power  spectrum.  To  illustrate  this  point  another  way, 
we  can  manipulate  the  temporal  symmetry  (and  the  am¬ 
plitude  distribution  as  well)  of  a  noa-Gaussian  process  by 
linear  operations  that  have  no  effect  on  the  power  spec¬ 
trum.  Pass  a  non-Gaussian  process  through  an  all-pass  fil¬ 
ter,  which  by  definition  only  affects  its  input’s  phase.  This 
phase  change  mod'fies  the  dependence  structure  of  the  in¬ 
put,  resulting  in  an  output  having  a  different  dependence 
structure  and  a  different  amplitude  distribution.  The  power 
spectra  of  the  filter's  input  and  output  arc  identical  since 
the  power  spectrum  is  insensitive  to  phase  distortions.  The 
Gaussian  process’s  insensitivity  to  phase  may  seem  “un- 
physical”,  a  notion  we  are  about  to  argue  for.  Quantities 
sensitive  to  temporal  symmetry  are  the  bispectrum  [8)  and 
the  conditional  mean  [4,11] 

Other  aspects  of  a  signal's  dependence  structure  are  af¬ 
fected  by  the  amplitude  distribution.  Consider  all  first- 
order  autoregressive  processes  parameterized  by  'he  pole 
location  c.  Given  a  Gaussian  amplitude  distribution,  any 
value  of  a  (consistent  with  stability  criteria)  is  possible 
all  first-order  dependence  structures  are  compatible  with 
the  Gaussian  amplitude  distribution.  The  Gaussian  is  not 
unique  in  this  regard;  for  discrete- time  signals,  all  ampli¬ 
tude  distributions  in  class  L  are  compatible  with  ail  first- 
order  dependence  structures  [11].  Among  these  are  the 
stable  distributions,  the  Laplacian,  and  the  hyperbolic  se¬ 
cant  [10],  For  other  amplitude  distributions,  not  all  value 
of  a  are  compatible.  Perhape  the  most  striking  is  the  uni¬ 
form  amplitude  distribution;  first-order  AR  processes  exist 
that  have  a  uniform  amplitude  distribution,  but  the  param¬ 
eter  a  can  only  equal  ±1/2, ±1/3,. ...  For  this  and  other 
m  o -class  L  distributions,  the  amplitude  distribution's  form 
has  a  direct  impact  on  its  dependence  structure 

For  our  purposes,  are  stress  the  close  coupling  between 
a  process's  temporal  dependence  structure  and  its  ampli¬ 
tude  distribution.  If  we  can  show  that  a  process  cannot  be 
temporally  symmetric,  we  must  conclude  that  it  cannot  be 
Gaussian.  This  constraint  allows  us  to  explore  the  ampli¬ 
tude  distribution  of  processes  governed  by  physical  laws  by 
considering  temporal  dependence  structures. 
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Figure  1:  The  measurement  system  extracts  information  from 
the  physical  system.  We  consider  the  two  together  as  thermody¬ 
namically  closed,  not  interacting  with  other  systems. 

III.  THERMODYNAMICS  AND 
MEASUREMENTS 

Virtually  all  signal  processing  algorithms  are  applied  to 
measurements  taken  from  some  physical  system  (figure  1). 
We  presume  that  the  measurement  process  is  not  intended 
to  modify  the  physical  system.  Instead,  the  intent  is  to  cap¬ 
ture  some  time-persistent  aspect  of  the  system.  We  thus 
expect  to  mod  *1  the  measurement’s  “random"  components 
as  a  stationary  process.  Non-statistical  components  are  of¬ 
ten  present  too.  For  example,  the  physical  system  could  be 
a  communication  channel  where  the  signal  represents  dig¬ 
ital  data  and  the  random  component  is  additive  channel 
noise.  The  signal  reception  process  should  not,  in  engineer¬ 
ing  jargon,  “load  down”  the  transmission  system,  continu¬ 
ally  changing  its  characteristics.  Under  these  assumptions 
on  the  measurement  process,  we  can  justify  using  station¬ 
ary  stochastic  models  to  describe  the  noise,  enabling  us  to 
derive  appropriate  signal  processing  procedures. 

The  effects  of  measurement  on  a  physical  system  can  be 
quantified  by  considering  thermodynamics.  The  key  con¬ 
cept  is  thermodynamic  entropy.  Loosely  speaking,  a  sys¬ 
tem’s  entropy  S  is  defined  to  be  kb P,  where  k  is  Boltz¬ 
mann’s  constant  and  P  is  the  number  of  accessible  micro¬ 
scopic  states.  The  Second  Law  of  thermodynamics  states 
thi.t  a  closed  system’s  thermodynamic  entropy  can  never 
decrease  and  that  entropy  increases  are  proportional  to  the 
work  extracted  from  the  system. 

AS>  0  and  <W=*TAS 

Modern  studies  in  the  luermodynamics  of  computation 
have  clarified  this  classic,  but  ill-defined,  concept  of  entropy. 
One  particularly  illuminating  definition  due  to  Zurek  [15] 
expresses  thermodynamic  entropy  as  the  sum  of  two  terms. 
The  first  is  the  Shannon  entropy  H  -  T.P  log  p  of  the 
probability  distribution  of  the  system's  state;  the  second  is 
the  sfjentAnue  entropy  K  defined  as  the  length  of  shortest 
possible  description  for  what  is  known  about  the  system. 3 
The  algorithmic  entropy  might  be  defined  as  the  logarithm 
of  the  shortest  possible  Turing  machine  program  needed  to 
describe  what  is  known  about  state.  Thus,  the  first  term  ex¬ 
presses  what  any  device  or  person  does  not  “know’  about  a 
system’s  state — the  uncertainty — and  the  second  expresses 

'We  ksve  as  UUapI  is  r  tail  ik«  units  of  eatropy  i|n>  aznoeg  lbs 
Ttrvxu  eouopy  it&ntbem*.  Tka  pap**  u  more  losctrttf  with  concepts 
ttu  details.  Is  lbs  cad,  each  must  b*  reuiUpfced  by  Boilssuas's 
coosUal  sad  tbs  lopuiltras  ba*»  s  comma  aslaral  bast 
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what  is  known. 


S  =  H  +  K 

Note  that  this  definition  is  “human- free:"  a  person  need 
not  be  present  to  make  the  measurement;  devices  such  as 
A/D  converters  are  encompassed  by  these  definitions. 

This  definition  for  thermodynamic  entropy  clarifies  dis¬ 
cussions  held  over  more  than  one  hundred  years  about 
Maxwell’s  demon  [1,2,3,12).  In  1871,  Maxwell  described  a 
“demon”  that,  by  measuring  nolecule  positions,  could  en¬ 
able  a  machine  to  do  work  v: .  uout  increasing  entropy.  Only 
by  recognizing  the  role  of  measurement  has  the  demon  been 
reconciled  with  the  Second  Law  of  thermodynamics.  Orig¬ 
inally,  Szilard  in  1929  [12]  and  Brillouin  in  1962  [3]  argued 
that  measurement  of  any  physical  variable  must  be  accom¬ 
panied  by  changes  in  the  information  content  (proportional 
to  the  Shannon  information)  of  the  variable.  This  measure¬ 
ment  translates  uncertainty  in  system  state  into  certainty 
and  thereby  increases  the  algorithmic  entropy.  Using  mod¬ 
em  terminology,  they  incorrectly  argued  that  balance  be¬ 
tween  state  uncertainty  and  knowledge  could  not  be  main¬ 
tained  (A H  >  -A K),  and  they  concluded  that  work  must 
be  performed  in  the  measurement  process  with  a  concomi¬ 
tant  increase  in  thermodynamic  entropy.  This  work  would 
exactly  balance  the  work  seemingly  provided  by  Maxwell’s 
demon  and  thus  uphold  the  Second  Law.  However,  these  re¬ 
searchers  did  not  explore  whether  a  more  efficient  technique 
existed  for  performing  the  required  measurement.  Based 
on  his  work  on  reversible  computation,  Bennett  in  1982  (1) 
showed  that  balance  between  Aff  and  AA'  could  b'  main¬ 
tained  theoretically  and  that  measurement  did  not  neces¬ 
sarily  increase  total  entropy.  Bennett  noted  that  the  demon 
must  return  to  its  original  state  to  initiate  another  cycle 
of  measurement  and  work.  To  return  to  the  original  state 
means  discarding  the  just  completed  measurement;  destroy¬ 
ing  the  certainty  gained  by  measurement  takes  work  and 
this  work  balances  the  decreased  algorithmic  entropy  [15). 
Thus,  a  detailed  analysis  of  information  transfer  explains 
why  Maxwell’s  demon  does  satisfy  the  Second  Law. 

In  most  physical  cases,  performing  a  measurement  does 
take  work,  meaning  that  the  measurement  system  consumes 
power  and  that  the  thermodynamic  entropy  of  the  com¬ 
bined  physical  and  measurement  systems  increases.  Ideally, 
if  sufficient  care  were  taken  in  the  measurement  process 
and  the  detailed  information  gleaned  was  never  destroyed, 
overall  system  entropy  would  be  constant.  Since  such  ideal 
circumstances  rarely  exist,  measurement  tn  most,  i /not  ell, 
physical  systems  u  net  thermodynamically  reversible:  once 
the  measurement  process  has  increased  thermodynamic  en¬ 
tropy.  the  measurement  cannot  be  undone  precisely  (algo¬ 
rithmic  entropy  precisely  traded  for  uncertainty).  Accord¬ 
ing  to  this  view,  a  sequence  of  measurements,  which  we 
express  by  a  scalar-valued  time  series  A'(<),  are  most  of¬ 
ten  obtained  by  increasing  thermodynamic  entropy  These 
increases  do  not  necessarily  mean  that  the  entropy  of  the 
physical  system  being  measured  has  increased.  Theoreti¬ 
cally,  •  prion  uncertainty  can  be  exchanged  for  measure¬ 
ment  certainly  without  increasing  entropy.  Consequently, 


measurement  does  necessarily  not  “loao  down”  the  phys¬ 
ical  system  and  the  resulting  measurements  can  be  well- 
modeled  by  a  stationary  process. 

Be  that  as  it  may,  the  measurement  process  cannot  be 
reversed  for  two  very  different  reasons.  First  of  all,  the 
work  expended  in  making  the  measurement  cannot  be  re¬ 
turned  by  undoing  the  measurement  process:  entropy  has 
increased  and  cannot  be  decreased  to  provide  the  neces¬ 
sary  work.  .Secondly,  and  most  importantly,  to  the  degree 
that  measurements  are  directly  associated  with  certainty 
about  system  state,  the  time-reversed  sequence  of  mea¬ 
surements  cannot  be  equivalent  to  a  measurement  sequence 
from  the  physical  system.  Such  temporally  reversed  mea¬ 
surements  would  seemingly  represent  undoing  the  measure¬ 
ments,  thereby  corresponding  to  increasing  system  uncer¬ 
tainty,  while  maintaining  constant  knowledge  about  state 
(after  all,  the  measurements  are  in  hand).  Siynals  thus  ob¬ 
tained  by  physical  measurements  cannot  be  temporally  sym¬ 
metric.  This  critical  fact  obviates  any  stationary  stochastic 
process  model  for  signals  or  noise  that  is  temporally  sym¬ 
metric. 

For  these  physical  reasons,  Gaussian  random  process 
descriptions  of  measured  signals  would  seem  to  be  an  ab¬ 
straction  without  a  physical  basis.  To  recap,  all  station¬ 
ary  Gaussian  stochastic  processes  are  temporally  reversible; 
processes  modeling  measurements  cannot  be  because  of  the 
Second  Law.  Thus,  non-Gaussian  processes  provide  the 
only  viable  model  for  physical  measurements.  However, 
temporally  symmetric  non-Gaussian  processes  are  also  in¬ 
appropriate;  because  of  Weiss's  theorem,  such  processes 
must  arise  from  nonlinear  models.  Temporally  symmet¬ 
ric  non-Gaussian  processes  that  describe  physical  measure¬ 
ments  can  be  produced  by  both  linear  and  nonlinear  mod¬ 
els  (11).  Our  interpretation  of  thermodynamics  has  not 
produced  further  restrictions  on  possible  random  process 
models  for  measurements. 

rv.  DISCUSSION 

Because  of  the  Second  Law  of  thermodynamics,  measure¬ 
ments  convey  the  conversion  from  information- theoretic  un¬ 
certainty  to  algorithmic  (measurement)  certainty.  Tempo¬ 
rally  reversing  the  time  series  could  not  represent  the  same 
measurement  process  as  the  reversed  lime  series  would  sug¬ 
gest  a  physically  impossible  situation:  the  con! meal  en¬ 
tropy  transformation  from  its  algorithmic  to  its  uncertain 
form  without  a  net  entropy  increase.  Based  on  these  physi¬ 
cal  restrictions,  the  moat  accurate  stochastic  process  models 
for  data  axe  temporally  asymmetric  ones. 

The  use  of  Gaussian  processes  in  signal  processing  would 
thus  appear  to  rest  on  wreak  ground,  justifying  considera¬ 
tion  of  alternate.  non-Gaussian  signal  processing  strategies. 
The  structure  and  properties  of  non -Gaussian  stochastic 
processes  need  to  be  understood  before  physically  relevant 
suhsets  of  this  clast  can  be  selected.  Once  the  process  class 
that  accurately  models  measurement  has  been  defined,  the 
signal  processor  would  naturally  seek  signal  processing  al¬ 
gorithms  that  could  ex  pleat  the  structure  imposed  by  the 
measurement  and  best  glean  the  information  contained  a 
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the  data.  Unfortunately,  optimal  noa-Gtmaian  signal  pro¬ 
cessing  operations  are  not  equivalent  to  the  Gaussian  ones. 
Detection  theory  provides  one  example  The  matched  fil¬ 
ter  applies  only  to  Gaussian  noise  problems;  optimal  detec¬ 
tion  for  non-Gaussian  noise  demands  alternative  structures. 
Furthermore,  the  analytic  simplicity  provided  by  the  Gaus¬ 
sian  process  rarely  carries  over  to  non-Gaussian  problems. 
Few  statistical  signal  processing  algorithms  have  been  de¬ 
veloped  that  are  tailored  to  the  amplitude  distribution  as 
well  as  to  the  temporal  dependence  structure. 

We  couid  apply  Gaussian-based  algorithms  to  non- 
Gaussian  problems.  Taking  another  example  from  detec¬ 
tion  theory,  one  could  use  a  matched  filter  (linear)  detector 
for  a  non-Gaussian  noise  problem.  However,  this  filter’s 
unit-sample  response  is  not  proportional  to  the  signal  as 
it  is  for  Gaussian  situations  [9],  Furthermore,  the  per¬ 
formance  for  the  optimal  linear  detector  can  greatly  sur¬ 
pass  that  designed  for  the  Gaussian  problem.  How  much 
the  optimal  linear  detector  degrades  system  performance 
when  compared  to  the  optimal  one  is  not  known.  We  need 
to  specify  how  to  vary  Gaussian-based  strategies  for  non- 
Gaussian  problems  and  to  quantify  the  losses  incurred  when 
Gaussian-based  systems  are  used  in  physical  situations  in¬ 
stead  of  those  keyed  to  physically  accurate  non-Gaussian 
models. 
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ABSTRACT 

In  this  paper,  we  apply  the  non  parametric  kernel  predictor 
to  tke  time-series  prediction  problem.  Because  nonpar  arse  t- 
ric  prediction  makes  few  assumptions  snout  tke  underlying 
time  series,  it  is  useful  wken  modeling  uncertainties  are  per¬ 
vasive,  suck  as  wken  tke  time  series  is  naa-Ganasiaa.  We 
akow  tkat  tke  nonpar ametric  kernel  predictor  is  asymptot¬ 
ically  optimal  (or  bounded,  miring  time  series.  Numerical 
experiment*  are  also  performed:  For  tke  nonlinear  autore¬ 
gressive  process,  tke  kernel  predictor  is  tkown  to  greatly 
outperform  tke  linear  predictor;  for  tke  Henon  time  series, 
tke  estimated  predictor  closely  resembles  tke  Henon  map. 


1.  INTRODUCTION 

Tune-series  prediction  is  a  problem  frequently  encountered 
in  many  branches  of  science  and  engineering.  In  this  pro’  - 
lent,  we  would  like  to  predict  future  values  of  a  time  series 
based  on  its  present  and  previous  observations;  consider,  for 
example,  linear  predation  In  practice,  this  prediction  pro¬ 
cess  consists  of  two  steps:  estimation  of  tke  predictor  {rota 
all  available  observations,  followed  by  prediction  of  s  future 
time-series  value  by  evaluating  tke  estimated  predictor  us¬ 
ing  present  observations.  Cl  anneal]  y,  tke  estimation  of  tke 
predictor  kaa  been  simplified  by  tke  assumption  of  *  pare- 
rue  Inc  model  of  tke  lime  aeries,  to  that  tk;  optimum  predic¬ 
tor  can  also  be  described  parametrically.  As  a  result  of  this 
riaplificalioa,  the  predictor  estimation  process  is  greatly 
reduced  to  tke  task  of  estimating  only  a  finite  cumber  of 
parameters.  In  tke  familiar  cane  of  linear  prediction,  we 
assume  tke  signal,  e.g.,  speech,  to  be  linear  au to- regressive 
(AR).  Tke  corresponding  predictor  is  then  formed  bj  esti¬ 
mating,  often  very  efficiently,  eg,  via  Levinson's  algorithm, 
tke  coefficient*  of  the  linear  model. 

Sonparometne  prediction  provides  an  alternative  to  the 
classical  method*  of  linear  and  higher -order  prediction  when 
parametric  spedkcnlioas  br  tke  time  aeries  art  other  *a- 
available  or  dsbtons.  This  approach  suly  assumes  that  the 
model  describing  the  time  senes  is  smooth  Whereas  a  para¬ 
metric  ft  is  global  and  spas*  all  of  stale  space.  *  nonpara- 
metric  entiaatoe  Sts  data  locally  by  taking  advantage  of 
the  smoothness  condition  Tke  dans  of  posubie  relation¬ 
ships  in  noepnrnmetnc  o-malio*  is  equivalet  l  to  the  dam 
of  smooth  functional,  wh  is  dearly  too  large  to  be  pa 
rameterisnd.  la  a  sense,  *e  may  contrast  parametric  sad 
nomparametric  estimation  by  the  aaaampUUis  they  make: 
Tke  par* metric  method  requires  gwartfifatiw  tpeohcaUon* 

*Weefc  mpportnd  ia  pert  by  the  ONR  undw  gram  NOCOIS- 
sp-j-mr  - 


a*  opposed  to  the  qualitative  assumption*  made  in  non- 
par  ametric  estimation.  Effectively,  nonpar  ametric  predic¬ 
tion  makes  only  modest  assumptions,  making  it  amenable 
wken  modeling  uncertainties  are  pervasive. 

2.  NONPARAMETRJC  PREDICTION 
Based  on  observations  Xi, Xj,...,Xs  of  stationary  time 
series  {X„},  we  want  to  estimate  Afrr+i  using  a  ptk  order 
predictor:  Based  os  tke  most  recent  p  observations,  nots- 
tionally  summarised  by  X*  =  [Xrr, . . . ,  Xs-r+ »],  estimate 
Xfi+i.  Tkat  is,  tke  predictor  of  Xs-ti  can  be  written  as 
Xs 41  =  f(Xjv),  where  #(-)  is  some  function  tkat  map*  BP 
to  SL  This  (unction  is  usually  unknown  and  must  be  es¬ 
timated  from  data  aa  well.  This  paper  does  not  address 
tke  order  determination  problem  and  assumes  that  predic¬ 
tor  order  p  is  known.  Tke  interested  reader  can  refer  to 
(lj  for  an  order  selection  technique  based  on  nonpar  ametric 
regression. 

If  Ike  mean  square  error  (MSE)  criterion  is  used,  tke 
optimum  p1*  order  predictor  of  Xset  is  tke  conditional 
expected  value  of  Xsex  given  Xs.--  ,Xs-ee: 

*  ^a,|X4 

Tke  coaditioaal  expectation  above  is  a  random  variable 
measurable  with  respect  to  tke  r-algebcs  of  (Xiv, .... 
Xw-yail  and  can  there  tore  be  expressed  as 

£tX*a.PU)  =  r(X,,). 

where  r()  is  a  function  that  maps  from  Rr  to  SL  We  call 
r(-)  the  condt  tonal  neon  function 

l.\.  Kernel  Predictor 

ta  practical  situations,  the  conditional  mean  fusctica  u  un¬ 
known  and  mutt  be  estimated  from  data.  Ia  this  estima¬ 
tion  process,  w*  search  for  a  mapping  tkat  best  describes 
the  caned  relationship  between  random  vector  X*  and  ran¬ 
dom  variable  (  biand  on  their  observations  (Xy.  .-Vy»; ). 
(X^si .  -V^j),  ....  {Xs-i.Xs)  (we  have  abbreviated  vec¬ 
tor  (A*.. ..  ,X,_psi]  as  X,).  W*  shall  refer  to  each  ob- 
aervatioa  vector  X«  aa  a  predictor  rector  and  each  scalar 
obnerratios  j  as  the  response  corresponding  to  predic¬ 
tor  vector  X». 

A  parametric  method  such  as  linear  prediction  assumes 
that  r(-)  is  linear,  thus  constraining  the  estimate  to  be  lin¬ 
ear.  The  sea  par  ametnc  method  does  not  constrain  the 
form  of  the  estimate  By  la  big  advantage  of  tke  smooth - 
sen*  of  e(  ),  tke  son  parametric  kerne f  repression  estimate', 
tke  noeparamelrtc  estimator  of  Ike  coadjUoasl  mean  fare 
Uon  al  a  poant  x  €  R’  consuls  of  locally  averaging  the 
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Figure  1.  Scatter  Plot  and  Kernel  Regression  Estimate 


Figure  2.  Over-smoothing  and  *  ■  dec-smoothing 


responses  corresponding  to  those  predictor  vectors 

within  a  neighborhood  of  x: 
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responses  within  a.  (infinitesimally'  ill  neighborhood  of 
the  conditioned  predictor  veclr 

Completing  the  prediction  p:  css,  we  evilait*  the  kernel 
regression  estimator  f  (-)  at  tha.  present  predictor  vector  x  = 
X/j.  We  can  then  write  **  kernel  predictor  of  A' > + 1  as 

X«4i  =  r(X.v)-  (2) 


In  the  numerator  of  (1),  kernel  K( ■),  along  with  bandvidll-. 
kx,  play*  the  role  of  a  weighting  function  »od  assigns  a 
weight  to  each  response  X„+i  baaed  on  the  distance  be¬ 
tween  predictor  vector  X,  and  x.  A'{-)  is  generally  positive 
and  decrease*  from  the  origin;  one  example  is  the  multivari¬ 
ate  Gaussian  function.  Hence,  response.!  corresponding  to 
predictor  vectors  close  to  x  axe  weighed  more  heanly  and 
have  more  effect  than  those  with  predictor  vector*  afar.  1  he 
bandwidth  parameter  Vv  haa  the  role  of  controlling  the  ex¬ 
tent  of  the  local  neighborhood  about  x.  A  large  bandwidth 
allow*  more  responses  (corresponding  to  predictor  vectors 
around  x)  to  be  averaged,  and  a  small  bandwidth  has  the 
opposite  effect  Choosing  an  appropriate  bandwidth  i*  cru¬ 
cial  Sor  a  good  estimate;  this  is  discussed  in  the  next  sec¬ 
tion.  As  a  general  rule,  the  bandwidth  will  decrease  as  A' 
increases  because  local  volumes  will  be  (tiled  more  densely 
when  more  data  become  available.  The  denominator  in  (1) 
simply  serves  to  normalise  the  weighting  ot  the  responses. 

As  x  varies  within  its  domain,  the  kernel  estimator  can 
be  viewed  as  a  moving  average  in  predictor- vector  space, 
as  opposed  to  the  uses]  ao'toa  of  a  moving  average  in 
time.  Figure  1  shows  ti?  scatter  plot  (A\,«i  versus  A'.) 
of  4V  *1  100  samples  of  the  first-order  nonlinear  autoregres¬ 
sive  ( SA  R(S ))  time  senes 


2.2.  Bandwidth  Selection 

Selecting  an  appropriate  bandwidth  is  crucial  for  good  es¬ 
timates.  If  the  bandwidth  is  too  large,  over-smoothing  oc- 
cuis.  Likewise,  if  the  bandwidth  is  100  small,  the  resulting 
estimate  is  under-smoothed  and  jagged.  See  Figure  2. 

To  be  consistent  with  the  noaparametnc  nature  of  the 
kernel  predictor,  the  selection  of  the  bandwidth  should  be 
based  on  information  inherent  in  the  data  and  not  on  a  pri¬ 
on,  possibly  inaccurate,  assumption*.  Such  a  iato-driven 
technique,  called  cross  validation,  his  gained  wide-  range 
support  among  statisticians  and  time-aeries  analysts.  De¬ 
fine  the  leave-one-out*  regression  estimator  r>j,{x)  a*  fol¬ 
lows: 


Z-L..  *  (nr) 


The  appropriately  named  leave-one-out  estimator  f>  *(*)  is 
the  kernel  regression  estimator  of  (1)  computed  without  us¬ 
ing  the  pair  (X*.  A”.*i ),  The  cross  ceisdefton  function  CV*( ) 
is  the  sample  prediction  error  using  the  se*ve-ea4-oet  esti¬ 
mator  as  the  predictor: 


\\+x  m  2««(A'.)  +  fc.  -  . .id.  A*(0. 1) 

The  true  conditional  expectation  r(s)  **  2  nn(  1 )  is  shown 
by  the  dashed  curve.  Also  shown  is  the  kerned  regression 
estimator  f  (*)  inside  the  interval  [—3  3, 3.0!;  it  ts  coar-puted 
using  a  Gasser  as  kernel  with  a  bandwidth  of  Arm  «  0.25. 
At  s  =»  -2.9,  we  show  how  the  responses  are  windowed 
to  produce  the  corresponding  estimate,  as  is die* ted  by  the 
asterisk.  The  kernel  estimator  repret  sts  a  aatural  and  in¬ 
tuitive  estimator  of  the  conditional  mean  function  because 
the  conditional  mete  is  nothing  b*l  the  local  average  of  the 


x-i 

CV(k)  m  i  £(#..*(X.)  -  A\*,  )* 

The  bandwidth  k  that  minimises  CV(-)  is  selected  and  used 
to  compute  the  regression  estimator  r(-).  This  minimisa¬ 
tion  is  performed  over  all  possible  4  values,  and  the  con¬ 
straint  stt  for  k  may  r\|a ire  some  subjectivity  ta  c.Jer  to 
reduce  tk*  amount  of  computations.  If  f(-)  were  used  in¬ 
stead  of  f,.,  )  to  compute  CV(k),  it  caa  be  easily  shows 


Smmt 


that  C V ( h )  is  minimised  at  h  =  0,  yielding  a  useless  solu¬ 
tion.  The  use  of  r,,a()  effectively  thwarts  this  singularity. 
It  is  important  to  note  that  h  is  derived  completely  from 
data  and  is  consistent  with  the  nonparametric  nature  of  our 
prediction  method. 

3.  CONSISTENCY  RESULTS 

As  we  have  already  mentioned,  it  is  difficult  to  specify  joint 
distributions  of  non-Gaussian  time  series.  To  analyze  prop¬ 
erties  of  estimators,  however,  certain  specifications  on  the 
dependence  structure  of  these  time  series  must  be  made. 
We  therefore  select  a  very  general  specification,  called  a 
mixing  condition  [2],  for  the  time  series  we  analyze. 

Theoretically,  the  two  estimators  that  we  have  so  far  dis¬ 
cussed  are  different.  The  kernel  regression  estimator  f(-) 
is  an  (pointwise)  estimator  of  the  conditional  expectation 
function  r(x),  whereas  the  kernel  predictor  ?(X//)  is  the 
estimator  of  the  predictor  r(Xs)-  We  need  to  elucidate 
the  distinction  at  this  juncture  because  consistency  of  one 
estimator  does  not  imply  consistency  of  the  other.  The  dif¬ 
ference  here  is  akin  to  the  difference  between  pointwise  and 
norm  convergence  of  a  sequence  of  functions.  Because  our 
impetus  is  the  time-series  prediction  problem,  we  should 
concentrate  on  the  analysis  of  the  kernel  predictor.  Point- 
wise  consistency  of  the  kernel  regression  estimator  for  <j>- 
mixing  time  series  (and  others)  has  been  shown  [3].  See 
also  [4]  for  consistency  results  of  the  nearest-neighbor  re¬ 
gression  estimator. 

Next,  we  need  to  determine  an  appropriate  mode  of  sta¬ 
tistical  convergence  for  the  kernel  predictor.  Almost  sure 
consistency  can  be  found  in  [3].  However,  this  does  not 
imply  that  the  kernel  predictor  asymptotically  matches  the 
performance  of  the  conditional  mean.  For  this  reason,  we 
believe  that  it  is  inappropriate  to  analyze  the  a.s.  or  in  prob¬ 
ability  convergence  of  the  kernel  predictor.  Instead,  it  would 
be  more  appropriate  to  analyze  its  L3  convergence.  Lee  [5] 
has  shown  that  for  bounded  p,  and  a-mixing  time  series, 
the  kernel  predictor  is  asymptotically  optimal  in  the  sense 
thot 

^fr(X^)  -  r(X^)]3  —  0. 

The  rate  at  which  this  convergence  occurs  depends  on  the 
rates  at  which  the  mixing  coefficients  and  bandwidth  con¬ 
verge  to  zero.  For  example,  for  an  exponentially  4 -mixing 
time  series  (i.e.,  its  mixing  coefficient  4*  is  proportional  to 
a*,  a  <  1),  the  kernel  predictor  converges  at  a  rate  of 

£(f(X*)  -  r(Xjv)]1  =  0(h5*  +  log5  N/(Nk’s))  (3) 

Thus,  consistency  occurs  if  hv  0  and  (Nh^)/  log5  N 
oo.  Using  (3),  the  convergence  rate  can  be  optimized  by 
taking  the  bandwidth  to  be 

Ks,^oc(N/\og3  ••  (4) 

Convergence  is  faster  fur  time  series  that  have  weaker  de¬ 
pendence  structures  (fast  converging  mixing  coefficients).  If 
the  dependence  is  too  strong  (mixing  coefficients  converging 
too  slowly  to  zero),  the  kernel  predictor  is  not  insured  to 
converge.  The  most  rapid  convergence  occurs  in  the  trivial 
situation  when  time  series  samples  are  completely  indepen¬ 
dent. 

It  is  important  to  note  that  the  consistency  result  above 
.specifies  only  the  bandwidth  rater  that  are  admissible.  For 
example,  if  h/v  satisfies  (4),  so  does  Hhs,  for  some  non-zero 
constant  if.  -Clearly,  the  two  sequences  will  yield  different 


Figure  3.  Snapshot  of  NAR(2)  Time  Series 

errors,  but  the  rate  at  which  the  errors  converge  will  be  the 
same.  Thus,  bandwidth  values  that  work  in  theory  may 
not  be  obtainable  in  practice.  Cross  validation  can  provide 
somewhat  of  a  bridge  between  theory  and  practice.  In  the 
ii.d.  setting  (the  usual  setting  for  regression  analysis),  cross 
validation  is  asymptotically  optimal  for  the  kernel  estima¬ 
tor,  but  the  rate  is  slow  [6].  Unfortunately,  little  is  known 
in  the  time-series  setting  about  cross  validation  for  either 
the  kernel  estimator  or  the  kernel  predictor.  This  remains 
an  open  area  for  research. 

4.  EXAMPLES 
Nonlinear  Autoregressive  Process 

Consider  the  following  second-order  nonlinear  autoregres¬ 
sive  (NAR(2))  time  series. 


X„+J  =  0.9 


1  +  4(Xn-l  —  X„) 

l  +  tX.-r-Xn)5 


+  Wn+X 


with  Wn  ~  i.i.d.  N( 0,1).  This  time  series  is  a  gener¬ 
alization  of  the  well  known  linear  autoregressive  process. 
Unfortunately,  the  nonlinearity  makes  XH  very  difficult  to 
examine  analytically.  For  example,  we  do  not  know  its 
(scalar)  amplitude  distribution,  let  alone  its  joint  distri¬ 
butions.  In  fact,  we  do  not  even  know  if  it  is  station¬ 
ary!  Because  (1  +  4(xj  —  xo))/(l  +  (*j  —  *o)5)  is  bounded 
for  all  (xo.xj)  6  R5,  Xn  has  finite  moments.  The  scalar 
value  0.9  is  used  to  normalise  its  variance  to  approximately 
3.0.  Its  observed  mean  value  is  zero.  See  Fig.  3  for  a 
snapshot  of  200  sample  values.  Even  though  the  station- 
arity  of  X*  is  questionable,  its  conditional  mean  is  in¬ 
variant  to  time  index  n.  Because  Wn  is  independent  to 
Am,  m  <  n,  the  conditional  mean  function  is  simply 
r(*o,x»)  =  0.9(1  -f  4(*i  -  xo))/(l  4-  (si  -  x«)5)  for  all  n. 

We  test  the  kernel  predictor  at  sample  sixes  of  N  —  200, 
500, 1000,  2000,  and  3000.  For  each  sample  size,  cross  vali¬ 
dation  is  performed  to  select  the  appropriate  bandwidth. 
The  kernel  regression  estimate  at  N  =  500  is  shown  in 
Fig.  4;  the  linear  estimate  is  shown  for  comparison  sake. 
A  Gaussian  kernel  with  a  bandwidth  of  k  =  0.40  is  used. 
The  kernel  predictor  is  then  tested  against  the  nest  1000 
samples.  It  is  evident  that  the  linear  predictor  cannot 
adequately  capture  the  nonlinear  relationship  and,  conse¬ 
quently,  performs  poorly  when  compared  to  the  kernel  pre¬ 
dictor.  At  N  =  500,  the  kernel  predictor  has  a  MSE  of 
about  1.1,  compared  with  1.9  for  the  linear  predictor  and 
1.0  for  the  optimnm  predictor,  or  about  95%  of  optimal 
(because  X*  has  a  variance  of  3.0). 

Chaotic  Time  Series 

The  kernel  predictor  can  be  applied  to  lime  series  produced 
by  deterministic  difference  equation.  It  is  likely  to  perform 
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Figure  4.  Performance  Analysis  for  NAR(2)  Tinw  Series 

well  with  these  time  aeries  because  no  noise  it  present.  A 
special  case  of  such  a  time  series  is  one  that  is  chaotic. 
Consider  the  Henon  time  aeries  (Fig.  5): 

X*+i  =  1-1.4**  +0.3X._, 

with  initial  conditions  Y_i  =  Xo  =  0.0. 


ji - - - - - - - 1 
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Figure  S.  Snapshot  of  Henon  Time  Series 


The  same  prediction  strategy  as  before  is  followed  here.  A 
scatter  plot  of  Xn,  the  trne  Henon  map,  and  regression  esti¬ 
mate  at  N  =  S00  (viewed  from  two  perspectives)  art  shown 
in  Fig.  6.  Prediction  errors  are  negligible,  from  N  =  50 
on,  and  therefore  not  shown.  Because  the  kernel  estima¬ 
tor  performs  local  averaging  of  data,  estimates  are  made 
only  at  those  locations  where  data  aggregates.  Although 
little  inference  can  be  made  for  the  map  at  all  locations, 
the  resulting  estimate  is  effective  for  prediction  purposes. 

5.  CONCLUSIONS 

Nonpar aroetric  time-series  prediction  provides  as  alterna¬ 
tive  to  the  classical  methods  of  linear  and  higher -order  pre¬ 
diction.  Because  it  makes  only  modest,  qualitative  assump¬ 
tions,  son  parametric  prtdictios  may  be  applicable  even 
when  little  is  known  about  the  time  series  nnder  study. 
Such  situations  arise  when  the  time  series  is  neither  lin¬ 
ear  nor  Ganasiea.  In  light  of  present-day  emphasis  on  aon- 
Gaussiac  signal  processing,  it  would  seem  fitting  to  imcor- 
porate  nos  parametric  prediction  into  the  analyst's  toolbox. 

Open  questions  for  further  investigation  remain.  The 
most  notable  is  a  way  to  mitigate  the  effect  of  the  “curse 


Figure  6.  Henon  Time  Series:  Scatter  Plot,  Henon  Map,  and 
Regression  Estimates 

of  dimensionality,*  so  called  because  nonpar vmetric  meth¬ 
ods  require  a  large  amount  of  data  for  dimensions  higher 
than  about  three  (p  >  3).  This  shortcoming  needs  to  be 
overcome  before  problems  like  target  tracking  and  speech 
modeling  can  reap  the  benelits  provided  by  nonparametric 
prediction. 

In  the  case  of  chaotic  time  series  that  are  produced 
by  nonlinear  iterative  equations,  nonparametric  prediction 
performs  well  because  no  randomness  is  present  in  the  re¬ 
sponses.  Because  little  averaging  is  needed,  dimensionality 
effects  are  not  as  severe  as  for  stochastic  time  series.  Non¬ 
parametric  prediction  seems  to  have  much  potential  in  this 
area  [l,  7,  S]. 
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Atersrt-CarrtUted  non-Gaussian  Markov  sequences  can  be 
considered  as  filtered  white  noise  (independent,  identically  dis¬ 
tributed  sequences  of  random  variables),  the  filter  being  a  non¬ 
linear  system  in  genetal.  We  discuss  the  applicability  of  linear 
models  and  nonlinear  methods  based  on  the  diagonal  series  ex¬ 
pansion  of  bivariate  densities  for  analyzing  this  system.  Non- 
Gaussian  sequences  exhibit  different  properties  in  the  forward 
and  backward  directions  of  time.  We  explore  the  connection  to 
system  modeling  of  this  temporal  asymmetry  and  some  of  Its 
consequences.  As  an  example,  we  analyze  a  first-order  linear 
autoregressive  model  with  hyperbolic  secant  amplitude  distri¬ 
bution  at  its  output. 


I.  Introduction 

THE  signals  and  noise  encountered  in  the  signal  pro¬ 
cessing  environment  (e.g.,  ocean  acoustic  noise  [19]) 
are  often  not  Gaussian.  Be  that  as  it  may,  many  signal 
processing  algorithms  are  based  on  the  assumption  that 
the  signal,  or  noise,  or  both  are  Gaussian.  Even  when  the 
appropriate  r  or. -Gaussian  amplitude  distribution  is  used, 
the  samples  are  assumed  to  be  independent  or  at  least  un- 
correlated.  The  performance  of  algorithms  which  ignore 
the  non-Gaussian  nature  of  the  input  and/or  the  depen¬ 
dence  structure  is  seriously  limited  when  the  algorithms 
are  inappropriately  applied.  A  common  way  of  account¬ 
ing  for  the  dependency  of  non-Gaussian  dau  is  to  model 
the  process  as  a  pointwise  transformation  of  a  correlated 
Gaussian  process.  Although  this  method  facilitates  easy 
generation  of  dependent  processes,  it  yields  an  analyti¬ 
cally  complex  dependency  structure  which  is  insufficient 
to  describe  the  possible  range  of  dependencies  (18).  De¬ 
velopment  of  new  algorithms  which  take  into  account  the 
non-Gaussianity  and  correlation  structure  requires  an  in- 
depth  study  of  the  properties  of  non-Gaussian  time  series 
and  how  they  can  be  modeled  and  generated. 

The  correlation  function  is  inadequate  in  capturing  the 
dependency  structure  of  a  non-Gaussian  time  series;  only 
the  multivariate  Gaussian  density  depends  solely  on  the 
covariance  matrix.  Another  reason  for  this  inadequacy  is 
the  intriguing  fact  that  non-Gaussian  processes  often  cx- 
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hibit  different  properties  in  the  forward  and  backward  di¬ 
rections  of  time  quite  unlike  Gaussian  processes.  In  the 
sequel,  we  shall  refer  to  such  processes  as  being  tempo¬ 
rally  asymmetric. 1  The  inherent  symmetry  in  the  defini¬ 
tion  of  correlation  function  makes  it  insensitive  to  tem¬ 
poral  asymmetry  and  reduces  its  ability  to  capture  the 
dependence  structure  of  a  non-Gaussian  nrocess.  What 
aspects  of  non-Gaussian  sequences  are  then  important  in 
specifying  their  properties?  How  can  they  be  exploited  in 
signal  processing  algorithms?  A  key  issue  in  developing 
analysis  tools  that  can  be  used  to  ansv  er  tli^e  questions 
is  how  to  model  the  generation  of  non-Gaussian  signals. 
That  is,  given  a  specification  of  a  non-Gaussian  signal, 
how  can  it  be  produced  by  a  possibly  nonlinear  system 
operating  on  an  elementary  random  sequence?  Gaussian 
signals  can  be  generated  by  passing  independent,  identi¬ 
cally  distributed  (i.i.d.)  Gaussian  time  series  through  the 
appropriate  linear  system.  For  non-Gaussian  signals,  are 
nonlinear  systems  necessary?  If  so  when? 

Before  attempting  to  answer  these  questions,  we  make 
two  simplifying  assumptions.  First,  we  assume  that  the 
signals  are  (strict-sense)  stationary.  Second,  we  assume 
that  the  signal:  are  Markovian:  the  generating  systems  of 
such  signals  are  characterized  by  a  small  number  of 
"states."  Making  use  of  the  relatively  new  theoretical 
notion  of  temporal  symmetry,  we  will  discuss  the  suit¬ 
ability  of  linear  models  for  non-Gaussian  processes  and 
then  propose  a  technique  for  developing  nonlinear  models 
for  a  class  of  these  processes. 

II.  Temporal  Symmetrv 

All  stationary  Gaussian  processes  are  symmetric  with 
respect  to  the  (discrete)  time  axis:  a  time-reversed  sample 
function  XL,  of  a  Gaussian  process  is  also  a  sample  func¬ 
tion  of  the  same  process  and  is  thus  statistically  indistin¬ 
guishable  from  it.  Non-Gaussian  processes  do  not  neces¬ 
sarily  exhibit  this  symmetry.  For  example,  sunspot 
number  data  collected  since  the  year  1750  have  been  noted 
to  fail  more  rapidly  than  they  rise  [3].  Neural  discharge 
patterns  have  also  been  found  to  be  asymmetric  with  re¬ 
spect  to  time  (14). 

Definition:  A  stationary  process  {X^,  n  =  0,  ±  !,•••} 
is  temporally  symmetric  if  the  random  vectors  {X.,.  X.j. 

’TV  tem  timt  rrrernbiiity  hu  bee*  it  tV  li«»*suis.  Sc*  foe  ix- 
Mipie  {52J.  (17). 
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•  •  •  7*,}  and  X_w,  •  •  •  7_^}  have  the  same  joint 
distribution  for  ail  k  and  n,,  1  £  i  ■&  k  [32], 

Temporal  symmetry  of  Gaussian  processes  follows 
from  the  fact  that  the  joint  distribution  of  the  amplitudes 
of  a  stationary,  zero-mean  Gaussian  process  is  completely 
specified  by  the  covariance  function,  which  is  by  defini- 
ion  symmetric  for  all  processes,  Gaussian  or  not: 

=  E[X.^mi]  follows  from 
stationarity.  Weiss  [32]  showed  that  all  autoregressive 
moving  average  (ARMA)  non-Gaussian  processes  (i.e., 
those  that  satisfy  a  linear  difference  equation  of  the  form 
X„  =  Efij  OiX,,-!  +  LjmQbjWH_j,  where  {W„}  isani.i.d. 
sequence),  with  the  exception  of  purely  MA  processes  (a, 
=  0)  with  even  or  odd  symmetric  coefficient  sequences, 
are  temporally  asymmetric.  Hence,  the  only  temporally 
symmetric  linear  process  is  the  Gaussian.  Nonlinear  non- 
Gaussian  models  on  the  other  hand,  may  or  may  not  be 
temporally  symmetric.  The  importance  of  the  concept  of 
temporal  symmetry  stems  from  this  close  association  with 
the  linearity /nonlinearity  of  non-Gaussian  models. 

Clearly,  to  analyze  ftilly  the  temporal  symmetry  of  a 
stationary  process,  we  must  deal  with  joint  distributions 
of  the  amplitudes  of  the  process  at  arbitrary  times,  not 
simply  second-order  statistics.  Since  we  are  dealing  with 
Markov  processes,  it  is  sufficient  to  consider  bivariate  dis¬ 
tributions  (more  on  this  in  the  next  section).  The  bivariate 
distributions  of  a  temporally  symmetric  process  are  sym¬ 
metric  functions: 

Pz j&(*.  y )  =  Px.k.x-,(x,  y) 

=  /W*.  y)  *  P)u,x,(y.  x) 

ring  temporal  symmetry,  stationarity,  and  a  simple  reor¬ 
dering,  respectively.  On  the  other  hand,  the  bivariate  dis¬ 
tributions  of  'imporally  asymmetric  processes  are  not 
svmtr  Mnc  functions.  This  fac'  may  appear  to  be  counter¬ 
intuitive  since  the  ma-  rinal  amplitude  distributions  at  any 
two  time  inst-nts  must  be  iden'iud  due  to  stationarity: 

(  Px,.x,(*.  z)  13  f  />&.«&  y )  dy  «  px{z) 

These  equations  must  hold  whethc:  the  joint  amplitude 
distribution  is  symmetric  r  not  (i.e.,  the  process  is  tem¬ 
porally  symmetric  or  not).  Several  examples  of  asymmet¬ 
ric  joint  distributions  with  equal  marginals  are  given 
throughout  this  paper.  A  cominuous-timt  example  of  an 
asymmetric  Markov  process  (constructed  using  asymmet¬ 
ric  bhariatc  densities)  is  given  in  [32]. 

The  measurement  of  the  joint  d>st..bution  function  is 
highly  data  intensive,  and  hence  the  ouestior.  arises  as  to 
which  quantities  are  maximally  sensitive  to  the  statistical 
properties  of  an  observed  sequent,  pa-ricula.ly  its  tem¬ 
poral  symmetry,  and  how  these  quantities  can  be  used  in 
system  identifier  tion  procedure- .  The  chosen  statistics 
must  not  be  fundamentally  symmetric  quantities,  like  the 
correlation  function,  in  order  to  capture  any  temporal 


asymmetry  of  the  time  series.  Joint  asymmetric  averages 
(e.g.,  ffXiX*])  have  been  suggested  for  this  purpose  [6]. 
Higher  order  spectra  have  been  shown  to  be  suited  for 
parametric  linear  system  (ARMA)  characterization  [24]. 

The  statistics  having  possibly  more  utility  are  the  con¬ 
ditional  expectations  £[X„|X,-i]  and  £[X,_ ,  |X„]  as  they 
can  be  used  even  when  the  generating  system  is  nonlinear. 
These  quantities,  which  we  shall  refer  to  as  the  “for¬ 
ward”  and  “backward"  conditional  means,  are  not  in¬ 
herently  symmetric  with  respect  to  their  arguments  and  do 
provide  information  about  the  dependency  structure  of  the 
sequence:  if  these  two  statistics  are  different,  the  se¬ 
quence  cannot  be  temporally  symmetric. 

Consider,  for  example,  the  joint  distribution 

Px.,x.. ,(*.  y)  =  W*  -  jy  +  ?)  +  'ft*  -  jy  -  z)» 

Clearly,  this  joint  density  is  asymmetric  while  its  margin¬ 
als  are  uniform  over  [— $,  |].  Its  conditional  means  are 
calculated  as  the  means  of  its  two  conditional  probability 
densities  p*.  |  x, _ ,  (x  |  y)  and  px, . ,  |  x.  ( y  I  *) .  As  the  marginal 
is  uniform,  the  conditional  densities  have  the  same  func¬ 
tional  form  as  the  joint  distribution.  Respectively,  the  for¬ 
ward  and  backward  conditional  means  are  found  to  be 

£[*,_,  |XJ  =  27,  modi  -l 

The  forward  mean  is  affine  while  the  backward  mean  is 
discontinuous;  this  difference  clearly  demonstrates  the 
temporal  asymmetry  of  the  associated  process  if  it  exists.’ 

The  other  possibility,  namely  the  identicalness  of  the 
conditional  means  is,  however,  insufficient  to  prove  the 
temporal  symmetry  of  the  process.  After  developing  more 
insight  into  the  structure  of  non-Gaussian  Markov  pro¬ 
cesses,  we  shall  return  to  the  properties  and  applications 
of  conditional  means. 

IH.  Non-Gaussian  Markov  Processes 

A  random  process  (X,}  is  Mth-order  Markov  if  its  con¬ 
ditional  probability  densities  have  the  property 

Px.|x.-. .*.->•  ‘  •  ‘  >3.  ‘  ’  •) 

sPx.ix.-t.  ••• . *-«(*!*.  •••  .>*)  (1) 

for  all  n.  If  {7„}  is  also  stat  onary,  these  conditional  den¬ 
sities  do  not  depend  on  n.  The  process  {X,}  is  said  to  be 
completely  specified  if  ail  joint  densities  of  amplitude*  at 
differe.it  instants  are  known.  It  follows  easily  that  station¬ 
ary  Markov  processes  are  completely  specified  by  the 
conditional  density  furttion  given  above.  If  the  process 
is  first-order  Markov,  -  a  conditional  densities  can  be  ob¬ 
tained  from  the  "trsnv.ional  density"  Px.|x.-i('  i  *)  ty 

’A  process  hivinj  these  chstsetenstics  u  easily  defined  sod  trill  be  dis¬ 
cussed  its  the  aesl  section 


x, 


using  the  Chapman-Kolmogorov  equation  [7,  p.  89] 

Px.1*.— C*|y) 

=  (px.|*.-.(*|z)Px.-.|*,-(2|y)^.  (2) 


Thus  the  transitional  density  or  equivalently  the  bivariate 
density  Px..x.-,(x>  y)  completely  specifies  the  fiist-order 
Markov  process. 

The  definition  of  Markov  process  given  in  (1)  is  one 
sided  and  gives  the  impression  that  -  Markov  process  has 
an  inherent  direction  of  time,  namely,  past  amplitude  val¬ 
ues  specifying  the  statistical  properties  of  the  present.  One 
might  conclude  that  a  time-reversed  Markov  process  is  no 
longer  Markov.  However,  an  equivalent  definition  can  be 
given  in  terms  of  the  conditional  independence  of  the  past 
and  future,  given  the  present.  From  this  symmetric  defi¬ 
nition,  it  follows  that  if  {X,}  is  Markov,  so  is  {X_„}  [7, 
p.  83];  this  result  can  also  be  obtained  directly  from  the 
the  one-sided  definition  above  [22,  p.  386].  However,  the 
two  Markov  processes,  the  original  and  its  time  reversed 
version,  may  have  different  characteristics  and  hence  be 
temporally  asymmetric.  In  the  case  of  a  first-order  Mar¬ 
kov  process  where  y)  is  a  symmetric  function 

of  x  and  y,  it  follows  from  Chapman-Kolmogorov  equa¬ 
tion  that  Px,.x.-m(x>  y)  is  also  symmetric  for  all  m  2:  2 
and  as  a  result,  it  is  easily  seen  that  the  process  is  tem¬ 
porally  symmetric.  Hence  a  first-order  Markov  process  is 
temporally  symmetric  if  and  only  if  Px..x.-,(*>  y)  is  sym¬ 
metric. 

If  the  Markov  process  {X„}  is  Gaussian,  it  can  be  gen¬ 
erated  by  an  all-pole,  Afth-order  linear  system  described 
by 

Xt  =  +  a2X._,  +  •  •  •  +  auX,.M  +  W,  (3) 

where  (H',}  is  an  i.i.d.  Gaussian  sequence.  Processes 
generated  in  this  fashion  are  often  referred  to  as  being 
autoregressive  (AR).  Autoregressive  models  are  quite 
commonly  used  in  diverse  areas  such  as  geophysics  and 
speech  processing  [20]. 

Non-Gaussian  Markov  sequences  may  or  may  not  be 
generated  by  linear  autoregressive  systems,  but  they  can 
be  considered  as  a  generalization  of  AR  sequences,  which 
are  known  to  have  a  simple  statistical  structure.  To  obtain 
the  generation  model  of  non-Gaussian  Markov  sequences, 
we  must  begin  with  the  conditional  density  function.  Sup¬ 
pose  {I/,}  is  the  output  obtained  by  passing  a  Afth-order 
Markov  process  {X.}  through  a  system  having  the  input- 
output  relationship  given  by  the  conditional  distribution 
function  (cumulative) 

**  p •  •  •  ,  X,-*).  (4) 

{(/„}  is  then  i.i.d.  and  uniformly  distributed  over  (0,  1} 
(22,  p.  181],  Thus,  this  system  is  the  equivalent  of’ ‘whit¬ 
ening"  filter  for  Gaussian  time  series  and  yields  the  in¬ 
novations  sequence  {(/,}  corresponding  to  {X.}.  Typi¬ 
cally,  this  system  is  nonlinear  with  a  finite  number  M  of 
states.  Being  a  distribution  function,  the  above  input-out- 


ffl)  Un 


Fig.  1.  Generation  of  a  Markov  sequence  of  order  M. 


put  relationship  is  monotonic  and  hence  can  be  inverted 
to  give  the  generating  system  of  a  general  Afth-order  Mar¬ 
kov  time  series: 

=  f£|x.-t.--  JC.-jt(£/,«l •**-!.  •••  .*,-*)•  (5) 

This  generation  model  is  shown  in  Fig.  1.  In  the  Gaussian 
case,  this  generating  system  takes  the  form  of  a  memo¬ 
ryless  nonlinearity,  which  transforms  the  i.i.d.  uniform 
sequence  to  an  i.i.d.  Gaussian  sequence,  followed  by  an 
all-pole  linear  system.  In  the  general  case,  the  memory¬ 
less  nonlinearity  is  usually  present,  but  is  followed  in  gen¬ 
eral  by  a  nonlinear  system  having  memory.  In  either  case, 
specification  of  the  conditional  distribution  function  leads 
to  the  system  that  generates  the  process. 

A.  Non-Gaussian  Autoregressive  Processes 

Very  often,  inversion  of  the  conditional  distribution 
function  is  extremely  difficult  in  practice.  Sometimes  the 
conditional  distribution  function  itself  may  not  have  a 
closed  form.  To  proceed  further,  we  first  explore  linear 
models:  from  the  above  description,  we  assume  that  the 
memoryless  nonlinearity  transforms  the  i.i.d.  uniform  se¬ 
quence  into  some  intermediate  non-Gaussian  (i.i.d.)  se¬ 
quence  which  is  then  passed  through  a  linear  filter.  We 
are  thus  led  to  AR(Af)  models  for  non-Gaussian  se¬ 
quences.  Validity  of  a  linear  model  can  be  verified  in 
practice  and  we  illustrate  this  here.  From  now  on,  we  will 
focus  attention  on  first-order  Markov  processes  (Af  =  1). 

A  stationary  AR(1)  process  {X,}  is  defined  by 

X.  =  pX._,  +  W,  n  =0,  ±1,  •••  (6) 

where  { H', }  is  a  zero-mean  sequence  of  i.i.d.  random  vari¬ 
ables  and  |  p  |  <  1 .  The  system  constant  p  is  also  the  nor¬ 
malized  correlation  coefficient  of  the  output  process  {X,} . 
One  main  issue  of  concern  at  this  point  is  what  first-order 
Markov  non-Gaussian  sequences  can  be  characterized  this 
way  (i.e. ,  are  linear  processes)?  We  next  discuss  this  is¬ 
sue  via  i)  the  characteristics  of  the  forward  and  backward 
conditional  means  (and  their  relevance  to  the  direction¬ 
ality  of  the  process)  and  ii)  the  variety  of  amplitude  dis¬ 
tributions  for  {X,} .  We  will  have  more  to  say  about  the 
conditional  means  and  their  use  in  selecting  a  linear  ver¬ 
sus  nonlinear  model  for  non-Gaussian  data  in  sequel. 

1)  Conditional  Means  and  Directionality:  In  the  case 
of  linear  AR(1)  systems,  the  forward  conditional  mean  is 
a  linear  function,  the  slope  of  which  is  the  system  coef¬ 
ficient: 

E[X,|X._I]  =  pX._,. 

The  backward  conditional  mean  depends  heavily  on  the 
amplitude  distribution  of  the  input  If  the  input  is 

Gaussian,  the  backward  conditional  mean  is  same  as  the 
forward:  both  are  linear.  As  an  example  of  the  non-Gauss- 
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ian  case,  consider  the  AR(1)  model  with 
[—1/2, 1/2]  and  p  ■  l/*forJfc  *  ±2,  ±3 
process  has  the  conditional  means 


X„  uniform 
•  *  •  /  This 


ElK-M 


f(**J  mod  l-j, 
likXn  +  j)  mod  1  -  l 


Jkeven 
k  odd. 


The  dissimilarity  of  these  quantities  reveals  the  temporal 
asymmetry  of  the  time  series.  Although  a  general  expres¬ 
sion  for  the  backward  conditional  mean  of  non-Gaussian 
AR(1)  processes  cannct  be  obtained,  Lawrence  [17] 
showed  that  it  is  always  nonlinear.  As  this  result  is  useful 
for  us  later,  we  give  the  proof  here. 

Theorem  1:  The  backward  conditional  mean 
E[Xt-  ||X,7  of  an  AR(1)  process  is  linear  only  in  the 
Gaussian  case  [17]. 

Proof:  Making  use  of  stationarity  and  the  indepen¬ 
dence  of  X-i  and  WM,  we  find  that 


*ir(«) 


*x(“) 

4>x(pu) 


(7) 


the  forward  mean-squared  prediction  error  equals  the 
mean-squared  value  of  W„  [13].4  We  now  show  that  this 
inequality  of  forward  and  backward  prediction  errors  gen¬ 
eralizes  to  all  AR(1)  processes. 

Theorem  2:  The  backward  mean-squared  prediction 
error  of  a  non-Gaussian  AR(1)  sequence  is  always  less 
than  the  forward  mean-squared  prediction  error. 

Proof:  The  forward  conditional  mean  of  AR(1) 
models  is  linear,  hence  the  forward  mean-squared  predic¬ 
tion  error  is 

£[(*.  -  E[X„ |X,_,])2]  =  £[«,  -  pX„_|  -  Em)1]. 

(9) 

Since  the  conditional  expectation  is  the  best  mean-square 
estimator,  we  have 

£[(*.- 1  -  £[*._, | XJ)2]  £  £[(*.-,  -  tfX,))2]  (10) 

for  all  g(')  with  equality  only  when  g(X„)  =  £[X,_t  |XJ. 
Making  use  of  Theorem  1 ,  we  find  that 

£[(*„_,  -  £K,-,|XJ)2] 

<  £[(*,_,  -p*,-£[Hy)2]  (11) 


where  4>x(«)  =  £[e-^]  is  the  characteristic  function  of 
the  random  variable  X.  From  (6)  and  (7),  the  joint  char¬ 
acteristic  function  of  X,  and  X„- 1  is 

v)  =  £(«p  {juX,  +jvX'.{}] 
m  $x(pu  +  v)bx(u) /^xipu).  (8) 

Differentiating  with  respect  to  v  and  setting  »*0.we 
find  that 

;£[*._  ,e**|  =  *Kp«)*xO<)/*x(A«). 

Using  the  properties  of  conditional  expectation,  the  left- 
hand  side  could  be  rewritten  as ;£[e>a’£'(X,_ , |X,]].  If 
the  backward  conditional  mean  is  affine,  we  must  then 
have 

aVx (u)  +  ;M>x(u)  *  4>x(pu)4>x(a) /$x(p«). 

Dividing  by  ♦*(")  leads  to  a  functional  equation  requiring 
that  (")  he  affine  in  u,  which  implies  a  Gaussian 

marginal  distribution.  □ 

One  of  the  interesting  consequences  of  temporal  asym¬ 
metry  in  autoregressive  models  is  that  forward  and  back¬ 
ward  prediction  errors  need  not  be  equal.  Equality  of  these 
errors  is  implicit  in  signal  processing  algorithms  such  as 
Burg’s  maximum  entropy  method  (12,  p.  22).  In  the  case 
of  our  uniform  AR(1)  example,  it  follows  from  (7)  that 
the  input  takes  the  values  — (} it }  -  l)/2|k|,  -  (jk| 
-  3)/2|i|,  •  •  •  (|  A ]  —  1)/2|A|  with  equal  probability. 
It  can  then  be  shown  that  Xt.[  is  completely  determined 
by  X,:  X._,  *  ( kX, )  mod  1  -  1  /2  for  k  even  and  t 
*  {kX,  +  I  /2)  mod  1  -  1  /2  for  k  odd.  Thus  the  process 
has  zero  prediction  error  in  the  backward  direction,  while 


for  non-Gaussian  {X,}.  It  is  easily  verified  that  the  right- 
hand  sides  of  (9)  and  (1 1)  are  equal  and  hence 

£[(*-1  -  £[*.-> I X.1)2]  <  E[{Xtt  -  E[X. I*..,])2]. 

(12) 

□ 

2)  Amplitude  Distribution:  It  is  difficult  to  find  the  dis¬ 
tribution  of  the  output  of  the  linear  system  (6)  when  the 
input  {IF,}  is  non-Gaussian.  A  tractable  approach  for 
AR(1)  models  is  to  assume  a  known  amplitude  distribu¬ 
tion  for  the  output  {>[,}  and  then  derive  that  of  the  input 
{#'„}.  Given  the  characteristic  function  of  X,  the  ratio  of 
(7)  can  then  be  used  to  find  the  characteristic  function  of 
W  if  the  ratio  represents  a  valid  characteristic  function 
(i.e.,  the  ratio  must  be  a  positive  definite  function).  Com¬ 
plete  characterization  of  the  distributions  having  this 
property  and  thus  produced  by  linear  AR(1)  systems  is  not 
known  [11].  However,  under  the  restriction  that  the  model 
be  defined  for  all  values  of  the  system  coefficient  p  be¬ 
tween  0  and  1 ,  these  distributions  are  identical  to  the  class 
L  (or  self-decomposable)  distributions  well  known  in  the 
probability  literature  [9],  [10].  These  distributions  are  a 
subclass  of  the  infinitely  divisible  distributions  containing 
all  the  stable  distributions  and  have  been  shown  to  be  uni- 
modal  [34] .  In  the  important  case  of  symmetric  output 
distributions,  it  follows  from  (7)  that  membership  in  class 
L  guarantees  that  the  model  is  defined  for  the  entire  range 
-1  <  p  <  1.  Using  the  Livy  characterization  of  infi¬ 
nitely  divisible  distributions,  we  represent  the  character¬ 
istic  function  of  a  symmetric,  non-Gaussian  random 


*Thu  curiom  eutopic  vu  ptAept  8ru  pointed  out  bj  RoseabUtt  [24, 
‘TV  earlier  eurapie  u  the  end  of  Section  fl  co-rjerp->atJj  to  i  «  2.  p.  52). 
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variable  belonging  to  class  L  as  [29] 


*x(«)  ■  log  *x(“) 


(eM  -  1)  —  da  (13) 


where  Ro  =  R  \  0  and  k(a)  is  an  odd  symmetric  function 
which  is  nonnegative,  nonincreasing  on  (0,  oo)  and  sat¬ 
isfies  the  condition  f|a|*iock(a)  da  +  Ji^xor'/kte)  da 
<  ».  Taking  the  derivative  on  both  sides  of  (13) 


.  d*x(u) 

*  du 


e^kia)  da 


which  is  the  Fourier  transform  of  Jfc(a).  This  formula  can 
be  used  to  verify  membership  in  class  L  and  further  char¬ 
acterize  its  subclasses.  To  belong  to  class  L,  the  deriva¬ 
tive  of  the  logarithm  of  a  candidate  distribution’s  char¬ 
acteristic  function  (multiplied  by  —j)  must  have  a  Fourier 
transform  having  the  properties  of  k(a).  In  the  accom¬ 
panying  table,  we  list  the  Livy  measure  functions  k(a) 
corresponding  to  some  of  the  known  symmetric  non- 
Gaussian  distributions. 

Consider,  for  example,  the  first-order  Laplacian  auto¬ 
regressive  (LAR(l))  model  [17],  If  {X.}  has  a  Laplacian 
density  with  zero  mean  and  unit  variance,  px(x)  =  exp 
{-•Jl\x\} /'Jl  and  ix(u)  “2/(2  +  u1).  Substituting 
this  characteristic  function  into  (7),  we  find  that  the  result 
is  indeed  a  valid  characteristic  function  with 


**(«)  =  P2  +  (1  -  P2) 


2 

2  +  u5’ 


(e) 

Fig.  2.  Sample  functions  of  (a)  HAR(l).  (b)  Gaussian,  and  (c)  LAR(l) 
processes  with  a  correlation  of  0.8  between  adjacent  samples. 


Substituting  its  characteristic  function  into  (7)  and  eval¬ 
uating  the  inverse  Fourier  transform  [8],  [23],  we  obtain 
the  marginal  density  of  the  input  {H^}: 


Thus,  the  input  W„  is  zero  with  a  probability  pJ  and  a 
normalized  Laplacian  with  probability  1  -  p1.  In  other 
words,  the  generating  system  in  (6)  could  be  written  for 
the  LAR(l)  process  as 

X„  +  a,W’, 

where  [H''}  is  i.i.d.  Laplacian  (zero  mean,  unit  variance) 
and  c.  is  an  independent  discrete  random  variable  taking 
the  values  0  ind  1  with  probabilities  pJ  and  1  -  p2,  re¬ 
spectively.  Generation  of  the  Laplacian  model  thus  re¬ 
quires  a  random  coefficient  system.  One  of  the  conse¬ 
quences  of  a  random  coefficient  generating  model  for  the 
Laplacian  case  is  the  appearance  of  exponential  "run 
downs”  (with  increasing  probability  as  p  increases)  in  the 
sample  functions.  This  effect  is  illustrated  in  Fig.  2(c)  for 
a  high  correlation  of  0.8.  This  trend  may  be  unsuitable 
for  modeling  a  given  set  of  data. 

In  contrast,  suppose  {X*}  has  a  hyperbolic  secant  am¬ 
plitude  distribution5: 

1  TX 

Psix)  “  ^  SB*  Y 

*An  iatereui&|  property  of  thu  dtttnbuttoa  a  thit.  just  u  in  the  Gsuis- 
iu  cue,  iu  darocteristic  functioe  hu  the  une  functional  form  u  the 
emplitade  distribution:  ♦,(«)  -  tech  tt. 


PwM 


cos  ap/2  cosh  rw/2 
cos  t  p  +  cosh  tw 


-oo  <  w  <  oo. 


The  system  thus  required  to  generate  first-order  Markov 
hyperbolic  secant,  HAR(l),  distributed  data  is  not  a  ran¬ 
dom  coefficient  system.  Instead,  the  input  to  the  first-or¬ 
der  AR  system  is  a  sequence  of  i.i.d.  random  variables 
having  the  distribution  given  above.  HAR(l)  data  having 
with  a  correlation  of  0.8  is  plotted  in  Fig.  2(a)  along  with 
Gaussian  and  Laplacian  AR(i)  data  of  same  correlation 
for  comparison.  Although  both  the  hyperbolic  secant  and 
Laplacian  densities  have  exponential  tails,  the  HAR(l) 
and  LAR(l)  data  differ  markedly  because  of  the  exponen¬ 
tial  rundowns  in  the  LAR(l)  case. 

Note  from  Table  I  that  only  in  the  Laplacian  case  is 
k(a)  bounded  at  the  origin  and  that  it  requires  a  random 
coefficient  system.  In  general,  we  can  use  this  bounded¬ 
ness  criterion  to  determine  which  class  L  distributions  will 
necessitate  a  random  coefficient  system.  For  the  density 
function  of  [R',}  not  to  have  an  impulse  at  the  origin 
(which  results  in  a  random  coefficient  system) 


4>j(pu) 


0 


lim  tjf(u)  -  ♦xO*) 


—  00. 


From  (13),  we  must  then  have 


lim 


-  e*~) - da 


—  o». 


«—  m 


or 
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TABLE  I 

LSvy  Measure  Functions  of  Some  Symmetric  Non-Oausjian  Distrirutions 


Characteristic  Function 

Livy  Measure  Function 

Distribution 

♦*(«) 

*(«) 

Ltpiadan 

1 

1  +  a’ 

e*w  sifn  (a) 

Hyperbolic  secant 

tech  a 

cotech  (a) 

Hyperbolic  secant  squared 

—  cosech  f— ) 

sifn  (a) 

exp{2|a|)  -  l 

Cauchy 

rot 

Symmetric  stable 

•*(-  '  fcW*} 

^  0T(0)siny  J 

1  a  1  sifn  (a) 

0  <  fi  <  2 

On  the  other  hand,  s:nce  lim.-,  k(a)/a  da  ~ 
lima_0  k(a),  for  the  input  distribution  not  to  have  an  im¬ 
pulse  at  the  origin,  the  Livy  measure  must  be  unbounded 
at  the  origin.  All  k(a)  which  are  bounded  at  the  origin  will 
necessarily  demand  random  coefficient  systems  for  the 
generation  of  data  having  the  corresponding  amplitude 
distribution. 

Because  the  input  distribution  for  the  HAR(l)  model  is 
absolutely  continuous,  the  transitional  density  of  this  first- 
order  Markov  process  can  be  derived.  Using  (6) 

Px.ix.-.tyl-*)  =Pr(y  “  A*) 

=  cos  rp/2  cosh  x(y  -  px)/2 
cos  vp  +  cosh  x(y  -  px) 


Let  us  now  detail  how  HAR(l)  data  can  be  generated. 
First  the  i.i.d.  input  sequence  {IF,}  is  generated  from  the 
independent  sequence  {f/,}  distributed  uniformly  be¬ 
tween  0  and  1 ,  using  the  pointwise  transformation  W,  = 
Pi>'(U,)  where  P*1  (•)  is  the  inverse  of  the  distribution 
function  of  {IF,}.  The  correlated  data  sequence  {AT,}  is 
then  obtained  by  passing  { IF,}  through  the  linear  system 
defined  by  (6).  This  procedure  is  precisely  the  inversion 
of  conditional  distribution  function  described  in  the  pre¬ 
vious  section,  a  general  procedure  now  simplified  by  the 
assumption  of  the  linear  model.  The  distribution  function 
of  Wt  is  found  to  be 


P*(») 


(15) 


where  Px(x)  =  (2/ t)  tan'1  (exp  {tjt/2}).  Using  this  dis¬ 
tribution  function  in  (IS)  and  evaluating  the  inverse,  we 
obtain 


.  2  . 

Pw'  (u)  «=  -  sinh  1 

T 

If  we  remove  the  restriction  demanding  a  model  for  all 
p,  output  distributions  not  in  class  L  are  possible.  In  some 
cases  (as  when  the  characteristic  function  $y(u)  is  non¬ 


monotonic  [1 1]),  there  is  a  critical  value  pe  of  the  system 
coefficient  beyond  which  the  ratio  in  (7)  exceeds  unity,  a 
situation  incompatible  with  the  ratio  being  a  characteristic 
function.  Consequently,  highly  correlated  sequences  can¬ 
not  be  generated  having  such  amplitude  distributions  while 
they  can  be  generated  for  smaller  correlations.  If  the  char¬ 
acteristic  function  has  zeros,  the  range  of  p  is  further  re¬ 
stricted.  Supposing  that  uq  is  a  zero  of  <!>*(•),  the  denom¬ 
inator  of  (7)  becomes  zero  when  u  -  u^/p;  for  the  ratio 
to  be  bounded,  u,  =  u0/p  must  also  be  a  zero  of  ♦*(•). 
This  condition  then  becomes  recursive  since  a  zero  is  re¬ 
quired  at  u2  -  uo/p2.  “j  =  «o /p\  etc.  Thus  ♦*(•)  must 
have  an  infinite  number  of  zeros  if  it  has  any.  For  the 
uniform  AR(1)  example,  the  characteristic  fonction  is 
sin  u/u  and  has  an  infinite  number  of  equally  spaced  ze¬ 
ros.  First-order  Markov  uniformly  attributed  time  series 
are  thus  defined  only  forp  =  \/ktk-  ±2,  ±3,  •  •  •  . 
Systems  with  such  discrete  sets  of  coefficients  seem  to  be 
of  academic  interest  only. 

Verification  of  compatibility  of  a  time  series’  amplitude 
distribution  witn  the  conditions  implicit  in  (7),  positive 
definiteness  of  the  ratio  $>x(u)/$x(pu),  represents  a  for¬ 
midable  task  if  only  an  analytic  approach  is  used.  Success 
is  limited  by  one’s  ability  to  derive  the  inverse  Fourier 
transform  of  this  ratio  and  show  that  the  result  is  non¬ 
negative  fof  some  range  of  p.  We  used  numerical  methods 
to  check  for  the  existence  of  first-order  Markovian  non- 
Gaussian  time  series  other  than  those  in  class  L  and  vari¬ 
ants  of  the  uniform  example  given  above.  Our  procedure 
can  be  used  whenever  a  symmetric  histogram  estimate  of 
the  probability  density  of  {X,}  is  available:  an  analytic 
specification  is  not  necessary.  The  test  consists  of  the  fol¬ 
lowing  steps. 

1)  Given  a  sampled  probability  density  function,  re¬ 
sample  it  at  a  lower  (rational)  rate.  Any  of  several  deci¬ 
mation/interpolation  strategies  can  be  used  here  [4]. 

2)  Fourier  transforms  of  the  original  and  downsampled 
density  are  computed  with  care  taken  that  the  sum  of  each 
density  sequence  is  unity. 

3)  The  point-by-point  ratio  of  these  transforms  is  com- 
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puted  and  windowed  to  eliminate  inaccurate  division 
where  either  of  the  transforms  is  small.  This  window  must 
be  chosen  so  that  negative  ripples  are  not  introduced  in 
the  amplitude  domain.  Consequently,  positive  definite 
windows  like  the  triangular  would  suffice.  We  have  found 
that  a  nondefinite  window  such  as  the  rectangular  one  can 
be  used  by  noting  how  negative  its  ripples  become  and 
numerically  checking  that  no  ripple  exceeds  that  value. 

4)  The  inverse  Fourier  transform  of  the  windowed  ra¬ 
tio  is  computed  and  checked  for  “essential”  positivity: 
negative  portions  are  allowed  to  exist  but  should  be  within 
numeric  inaccuracies. 

To  illustrate  the  procedure,  we  choose  the  sampled  ver¬ 
sion  of  the  weighted  sum  of  three  equal  variance  Gauss- 
ians  with  means  -1.2,  0,  and  1.2,  respectively.  The  re¬ 
sulting  density  is  multimodal  and  hence  is  not  in  class  L. 
We  investigated  whether  this  density  could  be  generated 
by  first-order  systems  with  coefficients  1  /5  and  1  /3.  See 
Fig.  3.  We  reduced  the  sampling  rate  of  the  density  vector 
by  factors  of  5  and  3  by  simple  downsampling,  taking 
care  that  aliasing  was  minimal  by  computing  the  Fourier 
transform.  The  point-by-point  ratio  of  the  two  transforms 
contained  numeric  noise  in  the  high  frequency  region  due 
to  rounding.  We  used  a  rectangular  window  to  remove 
this  noise  and  obtained  inverse  transforms  shown  in  the 
fourth  row  of  Fig.  3.  Clearly,  the  example  density  seems 
compatible  with  p  =  1/5  but  not  with  p  =  1/3  as  the 
latter  results  contain  significant  negative  values  in  the 
tails.  This  threshold  is  close  to  the  critical  value  pc  men¬ 
tioned  previously. 

While  this  numeric  approach  is  imprecise,  it  can  be  val¬ 
idated  via  simulation.  Assuming  a  candidate  distribution 
seems  viable,  the  result  of  the  numeric  test  is  the  ampli¬ 
tude  distribution  of  the  input.  By  simple  calculation  of  the 
partial  sums,  the  cumulative  distribution  of  the  input  can 
be  calculated  and  used  to  generate  the  i.i.d.  sequence 
{ W,}  predicted  by  the  computations.  By  passing  this  se¬ 
quence  through  a  first-order  filter,  estimating  the  ampli¬ 
tude  distribution  of  the  output,  and  comparing  the  esti¬ 
mate  with  the  candidate  distribution,  the  prediction  can 
be  confirmed.  We  performed  this  test  on  the  trimodal  ex¬ 
ample  just  described  forp  =  1/5.  The  resulting  estimate 
of  the  output  distribution  did  greatly  resemble  the  candi¬ 
date  distribution  and  verified  that  amplitude  distributions 
produced  by  first-order  systems  need  not  be  unimodal.  We 
have  thus  demonstrated  the  existence  of  such  densities 
more  directly  than  in  [11]. 


B.  Nonlinear  Markov  Processes  and  Diagonal 
Expansions  of  Bivariate  Distributions 
Bivariate  distributions  of  stationary  random  processes 
have  in  the  past  been  analyzed  using  series  expansion 
methods.  These  expansions  find  application  in  the  study 
of  Markov  processes  [33]  and  of  the  effect  of  nonlineari¬ 
ties  on  random  processes,  e.g.,  in  the  analysis  of  the  out¬ 
put  of  a  cascade  of  a  narrow-band  filter  and  a  square  law 
detector  [1]. 
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Fig.  3.  Figure*  in  left  tnd  right  columns  refer  to  system  coefficient*  of  1  /J 
and  1  /3.  respectively.  The  first  row  depicts  sampled  output  amplitude  dis¬ 
tribution*  and  down-sampled  versions.  The  second  row  shows  the  corre¬ 
sponding  characteristic  functions  obtained  by  taking  Fourier  transforms. 
Point-by-point  ratio  of  these  transforms  is  shown  next  along  with  the  rect¬ 
angular  window  used.  Finally,  the  computed  input  amplitude  distribution 
(inverse  transform)  is  shown  along  with  the  output  distribution. 


Let  Px.r(x,  y)  be  the  joint  density  of  random  variables 
X  and  Y  with  marginal  densities  px(x)  and  p-( y),  respec¬ 
tively.  Suppose  Px,y(x.  y)  satisfies  the  condition 


Px.r(x>  y ) 
PrtoPr(y) 


dr  dy  <  oo. 


(16) 


Then,  complete  orthonormal  sets  {d>/r)}"0  and 
{4'J(y)}j‘-o  can  be  defined  in  L\px  dx)  and  L\pr  dy)  such 
that  the  series  expansion 


Px.r(x<  y )  "  PrMiPr(y)  1 


\4,{x)t,{y)  | 


(17) 


commonly  referred  to  as  the  diagonal  expansion  because 
of  the  single  summation,  converges  in  mean-square  sense 
[16].  A  well-known  example  is  Mehler's  expansion  of  a 
bivariate  Gaussian  density  in  terms  of  Hermite  polyno- 
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mials  [1].  Note  that  by  definition  $<>(•)  «  ^0(-)  *  1  and 
that  Ao  **  1.  The  coefficients  of  expansion  are  given  by 

-  J  j  Mx)My)pxj(*.  y)  dxdy 

and  are  by  convention  taken  to  be  nonnegative:  the  sign 
is  incorporated  in  the  orthonormal  sets  [30].  The  ordering 
of  the  basis  functions  is  determined  so  that  the  coefficients 
represent  a  decreasing  sequence:  0  s  \  S  1  and  \  & 
\j,  i  £  j.  Furthermore,  note  that  the  orthonormal  sets  are 
complementary  eigenfunctions  of  the  bivariate  density: 

J  Px.r(x.  y)<t>,<x)  dx  =  XaMj) 
j  px,y(x>  y)Uy)  dy  =  AM*)- 

The  terms  inside  the  summation  sign  of  (i'7)  account 
for  the  dependency  between  the  two  random  variables  X 
and  Y;  if  A,  =*  0  for  i  2:  l,  X and  Tare  statistically  in¬ 
dependent.  The  coefficient  Xt  is  referred  to  as  the  maximal 
correlation  coefficient  because  it  is  the  supremum  over  all 
(finite-variance)  functions  gi(-)  and  g2(-),  of  the  normal¬ 
ized  correlation  between  g,(X)  and  gi(Y)  [28].  If  <£,(*) 
and  ^i(-)  are  affine,  it  follows  from  orthonormality  that 
$i(*)  =  (*  “  fiX)/a x  and  ^,(y)  =  (y  -  nr)/or  where  p 
and  o1  are  the  mean  and  variance,  respectively.  In  this 
case,  the  maximal  correlation  coefficient  Xt  coincides  with 
the  usual  correlation  coefficient.  However,  Xt  is  in  gen¬ 
eral  larger  than  the  correlation  coefficient  in  magnitude 
and  gives  a  better  characterization  of  the  dependence  be¬ 
tween  X  and  Y:  the  random  variables  are  independent  if 
and  only  if  X,  is  zero  [26].  The  maximal  correlation  coef¬ 
ficient  and  the  functions  ^j(-)  and  ^t(-)  are  important  in 
approximating  the  series  expansion  in  (17)  with  a  finite 
number  of  terms. 

Let  us  apply  the  foregoing  discussion  to  a  stationary 
Markov  random  sequence  {X,} .  Denote  its  marginal  den¬ 
sity  by  px(-)  and  the  joint  density  of  X,  and  X._.  by 
Px.,  x.— ‘)-  We  must  have 

|  Px*x.-.(*.  y)dy  a  Pxto 

and  j  Px^x.-.(*.  >)  dr  =  px{y)  (18) 

for  all  m  because  of  stationarity.  These  conditions,  how¬ 
ever,  do  not  restrict  the  bivariate  density  functions  of  the 
process  to  be  symmetric;  Pr,.r.-«(*.  y)  is  not  necessarily 
identical  to Px..x..m{y,  x).  Although  asymmetric  diagonal 
expansions  of  the  form  given  by  equation  (17)  have  been 
studied  before  [2],  [26] ,  [31],  they  have  not  been  applied 
to  random  processes.  A  special  case  of  (17)  is  commonly 
considered  where  the  two  sets  of  orthonormal  functions 
are  identical,  which  yields  a  symmetric  bivariate  density 
and  imposes  temporal  symmetry  on  the  underlying  time 
series.  As  we  have  noted  earlier,  in  contrast  to  the  Gauss¬ 
ian  case,  the  bivariate  densities  of  non-Gaussian  pro¬ 
cesses  need  not  be  symmetric  because  of  temporal  asym¬ 


metry.  For  example,  for  the  HAR(l)  model 


Px.,x.-.(*.  y) 


cos  xpm/2  cosh  T(y  -  p"x)/2 
cos  x pm  +  cosh  r{y  -  pmx) 

(19) 

The  necessity  of  the  general  expansion  (17)  for  non- 
Gaussian  processes  is  thus  clear. 

Using  the  diagonal  expansion  of  p&.x«  - 1  (*.  y)>  we  can 
write  the  conditional  distribution  function  of  the  Markov 
process  {X,}  as 


*>x.ix.-i(*lx)  =  j^Pxto  i  AMzMOOj 


dz. 


(20) 


The  diagonal  expansion  thus  serves  as  a  tool  for  analyzing 
the  generating  system  of  the  Markov  process.  However, 
as  we  have  seen  in  Section  n,  we  heed  the  inverse  of  the 
above  conditional  distribution  function  to  generate  the 
Markov  sequence  from  an  i.i.d.  uniform  sequence.  Cal¬ 
culating  the  inverse  of  the  summation  in  (20)  even  in  an 
approximate  form  is  extremely  difficult.  If,  however,  we 
limit  the  diagonal  expansion  to  a  finite  number  of  terms 
(making  sure  that  the  integrand  is  nonnegative)  such  that 
the  conditional  distribution  function  can  be  inverted,  we 
have  a  method  for  generating  correlated  non-Gaussian 
Maikov  sequences  that  are  not  necessarily  linear.  Since 
the  additional  dependency  between  X„  and  X„_ ,  with  each 
added  term  decreases  progressively  (X,  >  X,+  1),  these 
terms  could  be  selected  to  match  the  required  dependency 
to  a  large  extent.  Sarmanov.[27]  studied  the  finite  sum, 
continuous-time  version  of  (20).  For  continuous-time 
processes,  diagonal  expansions  with  finite  number  of 
terms  cannot  be  used  when  the  eigenfunctions  and  the 
marginal  density  function  are  continuous:  the  finite  sum 
does  not  remain  nonnegative  over  the  entire  domain  for 
all  values  of  separation  t  between  the  samples.  Fortu¬ 
nately,  this  problem  does  not  arise  in  the  discrete-time 
case. 

If  the  domain  of  the  marginal  density  is  finite,  uniform 
on  [0,  I]  for  example,  polynomials  can  be  used  in  the 
finite  sums.  If  the  marginal  density  function  is  one  of  the 
classical  weight  functions,  orthogonal  polynomials  such 
as  Jacobi  polynomials  can  be  used;  otherwise  the  mo¬ 
ments  of  the  distribution  can  be  used  to  construct  orthog¬ 
onal  polynomials.  For  example,  a  uniform  [0,  1]  distrib¬ 
uted,  temporally  symmetric  Markov  process  can  defined 
by  the  joint  amplitude  distribution 

Pv..v..,(x,  y)  »  1  4-  a(2x  -  l)(2y  -  1) 


\a\*\.  OSrjS  1.  (21) 

For  distributions  defined  on  the  infinite  domain,  the 
functions  in  the  expansion  have  to  be  chosen  depending 
on  the  particular  case.  However,  some  recipes  applicable 
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to  all  situations  exist.  Take  for  instance,  the  temporally 
symmetric  process  defined  by  the  Morgenstem’s  family 
[5,  p.  578]  of  joint  densities 

Pr..Y.-M>  y )  ■  Pr(.*)Pr(y){l  +  a(lPY(x)  -  1) 

•  (2 P,(y)  -  1)}.  (22) 

It  can  be  obtained  by  passing  the  uniform  Markov  se¬ 
quence  {P,}  defined  in  (21)  through  the  nonlinearity 
Py'(-)  where  Py(-)  is  the  (cumulative)  distribution  func¬ 
tion  of  the  required  output  {K,}.  Such  an  operation  on 
{V„}  leads  to  Py,y,.x{x,  y)  -  Pv.y,.,(PY(x),  PY(y)}  from 
which  (22)  follows  by  differentiation.  Devroye  (5,  p.  580] 
gives  a  simple  generation  rule  for  {K,}  (and  thus  for 
{y*}).  A  more  direct  generation  via  (4)  requires  the  eval¬ 
uation  of  the  conditional  distribution.  This  procedure  re¬ 
sults  in  a  quadratic  equation  for  Py(YJ  involving  K,_| 
and  U„  (recall  that  this  is  i.i.d.  uniform  [0,  1]),  the  in¬ 
version  of  which  gives  us  the  generation  formula 


For  an  arbitrary  stationary  Markov  process,  the  condi¬ 
tional  means  may  or  may  not  be  linear.  These  quantities 
are  given  in  terms  of  the  components  of  the  diagonal  ex¬ 
pansion  as 

• 

£PU*.-|]  *  s  hElXMXJWX'-d 

m 

£[*■-. |xj  *  S  \E[X„-lUXm^)WXa). 

If  any  basis  function  {^/(*)}  is  linear,  under  the  condition 
that  the  fuuctions 

r  ^  px.,x.- it*,  o  r  ^  px.,x..fi'  y)  M 

J.  dx  px(x) '  ’  dy  px(y) 

do  not  change  sign  on  their  domain  (a,  b),  we  can  con¬ 
clude  that.  ,ij  other  linear  term  is  present  and  that  the  lin 


T, 


,  /3 afl(y..,)  ~  1  +  V[1  ~  3^<2Cy,-r)]2  +  12 aU^Y^) 
1  \  ficCfy,.,) 


QHYn-x)  *  C 

(20;-.)  =  o 


where  we  have  set  Q(Ym- {)  *  2Py(Yt.i)  -  1  for  sim¬ 
plicity. 

As  an  example  of  a  temporally  asymmetric  case,  con¬ 
sider  {Z,}  defined  by 

y)  =  Pzi*)Pz{y)  ( t  +  a(iPz(x) 

-  2Pz(y))(lPz(y)  -  1)}.  (23) 

Proceeding  as  above  for  the  generation  of  {Z»}.  we  find 
a  cubic  equation  in  Pz(Z.)  which  makes  the  generation 
difficult.  It  is  much  simpler  to  generate  fire  process  back¬ 
ward*.  In  this  case,  we  obtain  a  quadratic  equation  for 
1)  as  in  the  symmetric  case  above,  with  the  coef¬ 
ficients  being  different  functions  of  Z,  and  Um. 

A  major  drawback  in  using  the  diagonal  expansion 
method  for  generating  correlated  sequences  is  that  the  en¬ 
tire  range  of  correlation  coefficients  cannot  be  realized  (for 
the  processes  {K„}  and  {T,}  above,  |p|  s  1/3  and  for 
{Z»},  |p|  £  V2/3V5).  Maximally  correlated  random 
variables  are  important  in  variance  reduction  techniques 
in  Monte  Carlo  simulation.  Typically,  adding  more  terms 
in  uie  finite  sum  improves  the  available  range  of  correla¬ 
tions,  but  it  becomes  increasingly  difficult  to  ensure  the 
nonnegativity  of  the  sum.  The  question  of  the  available 
range  of  correlation  can  be  linked  to  the  comprehensive¬ 
ness  of  the  defining  bivariate  distributions  [5].  A  family 
of  bivariate  distributions  is  said  to  be  comprehensive  if  it 
includes  the  product  of  marginals  and  Frechet’s  extremal 
distributions  (which  result  in  extremum  positive  and  neg¬ 
ative  correlations  1  and  -1).  Clearly,  the  first  require¬ 
ment  is  satisfied  for  the  joint  densities  defined  using  di¬ 
agonal  expansion  methods  while  the  second  is  usually  not. 


ear  term  must  be  the  first  member  <£t(-)  [21].  The  series 
expansion  for  the  forward  conditional  mean  then  truncates 
with  the  result 

£PU*.-(]  »  \E[X.MXJ]UX'-d 

indicating  that  the  forward  conditional  mean  is  propor¬ 
tional  to  the  eigenfunction  ,).  Similarly,  when  one 
of  the  members  of  {&<•)}  is  linear,  the  expansion  for  the 
backward  conditional  mean  reduces  to  a  single  term.  In 
these  cases,  the  conditional  means  yield  direct  informa¬ 
tion  about  the  components  of  the  diagonal  expansion, 
thereby  leading  to  the  conditional  distribution  and  the 
generation  system. 

We  tested  the  analysis  procedures  described  here  as  well 
.  as  the  more  common,  Gaussian  based  ones  on  two  sets  of 
data:  the  linear  HAR(l)  time  series  {Xm}  and  the  nonlin¬ 
ear  time  series  { F,}  defined  by  (22)  also  having  a  hyper¬ 
bolic  secant  marginal  distribution.  The  nonlinear  model 
is  thus  defined  by 

1  tj  1  ,  ry  ( , 

Pr*r.-,&,  y)  “  j  sech  y  •  -  sech  —  jl  +  3a 

•  -  tan.'1  (eTl/I)  -  1 

Lx 

•  ^  tan'1  (e*7^)  -  1  j. 

The  correlation  coefficient  of  the  adjacent  samples  of 
this  process  is  pr  —  3 b2a  «■  0.88375  a  where  b  * 
/*) tan'1  e*1^)  (1/2)  sech  rz/2dz  =  14/r5  f(3) 
•and  ft ‘ )  ‘S  Riemann's  zeta  function.  Also,  from  (2)  we 
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Fig.  4.  Empirical  power  tpecua  for  temporally  symmetric  and  asymmetric  time  series  having  a  hyperbolic  secant  marginal 

amplitude  distribution. 


Fig.  5.  The  left  column  displays  empirical  forward  and  backward  conditional  means  for  a  temporally  symmetric  hyperbolic 
secant  time  series  and  the  right  column  the  eoodittooal  means  far  the  linear,  temporally  asymmetric  HAR(l)  time  series 
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and  it  follows  that  the  correlation  function  of  this  tem¬ 
porally  symmetric  model  is  Rr(m)  =  (1  -  3 b2)6(m)  + 
36:a".  Note  th3t  this  temporally  symmetric  first-order 
Markov  time  series  can  only  be  produced  by  a  nonlinear 
system.  The  correlation  function  of  the  corresponding  lin¬ 
ear  time  series,  while  having  the  same  marginal  distribu¬ 
tion.  is  simply  Rx  (*0  =  pm. 

We  generated  these  two  time  series  so  that  they  have 
the  same  coireiation  coefficient  of  0.25  between  adjacent 
samples.  The  power  spectral  density  estimates  of  the  two 
data  plotted  in  Fig.  4  arc  quite  similar.  The  forward  con¬ 
ditional  mean  of  the  HAR(l)  data  is  linear  while  the  back¬ 
ward  mean  is  not,  thus  confirming  its  temporal  asymme¬ 


try.  For  the  nonlinear,  temporally  symmetric  data,  both 
conditional  means  are  nonlinear  but  identical.  Sec  Fig.  5. 
The  conditional  means  thus  identify  temporal  asymmetry 
well  where  power  spectral  estimation  based  techniques 

faU. 


IV.  Conclusions 

Non-Gauss ian  processes  present  new  challenges  to  the 
statistical  signal  processor  attempting  to  develop  analysis 
techniques.  We  have  shown  that  correlation  analysis  can¬ 
not  be  expected  to  suffice,  which  immediately  distin¬ 
guishes  non-Gaussian  data  from  Gaussian.  Temporal 
symmetry  can  be  assessed  with  conditional  mean  analy¬ 
sis.  While  not  shown  here,  the  statistical  characteristics 
of  conditional  mean  estimates  are  identical  to  those  of  his¬ 
togram-based  probability  density  estimators  [25).  Con¬ 
sequently,  hypothesis  tests  for  determining  the  temporal 
symmetry  of  a  time  series  can  be  established.  However, 
first-order  conditional  mean  analysis  does  not  capture  all 
of  a  time  scries’  temporal  symmetry  properties:  similar 
forward  and  backward  means  can  belie  an  asymmetric 
process. 
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Successful  analysis  can  be  measured  by  one’s  ability  to 
generate  a  statistically  identical  version  via  simulation. 
This  yardstick  has  implicitly  formed  the  basis  for  our  sys¬ 
tem-based  modeling  of  non-Gaussian  Markov  processes. 
Linearity  of  the  system  can  be  tested  by  considering  the 
forward  and  backward  conditional  means.  If  the  forward 
conditional  mean  is  linear  and  the  backward  mear,  is  non¬ 
linear  (the  time  series  must  be  temporally  asymmetric  if 
linear  and  non-Gaussian),  then  a  linear  model  may  suf¬ 
fice.  If  a  first-order  model  is  appropriate,  the  amplitude 
distribution  of  the  driving  “white  noise"  process  can  be 
determined  with  our  numerical  method.  If  the  method  fails 
to  produce  a  valid  amplitude  distribution,  nonlinear 
models  may  be  the  only  recourse.  If  nonlinear  models 
seem  necessary,  our  theoretical  framework  is  insufficient 
to  produce  the  correct  generation  system  in  all  cases.  If, 
however,  the  maximal  correlation  coefficient  dominates 
the  diagonal  expansion  of  the  bivariate  density,  condi¬ 
tional  mean  analysis  can  yield  an  approximate  system. 
Whether  linear  or  not,  the  generating  system  examples 
shown  here  demonstrate  the  complexities  for  first-order 
sequences;  higher  order  ones  can  only  be  more  compli¬ 
cated. 

Alternative  approaches  to  non-Gaussian  analysis  are 
now  being  actively  studied.  Most  activity  is  devoted  to 
the  bispectrum  and  its  higher  variants.  This  approach 
clearly  has  uses;  for  example,  the  bispectrum  eliminates 
independent  additive  components  that  have  zero  skew. 
However,  this  approach  is  based  on  higher  order  corre¬ 
lation  functions  which  can  be  construed  as  an  ad  hoc  ex¬ 
tension  of  Gaussian-based  second-order  correlation.  The 
ability  to  extract  the  generation  system  for  a  time  series 
from  its  higher  order  correlation  functions  has  not  been 
demonstrated;  techniques  have  not  been  developed  to 
classify  the  system's  linearity.  Our  results  indicate  that 
consideration  of  both  the  directionality  and  the  amplitude 
distribution  arc  needed  for  any  scheme  to  be  considered 
capable. 

Processing  of  non-Gaussian  data  also  requires  the  mea¬ 
surement  of  new  properties  beyond  correlation  analysis. 
Many  signal  processing  algorithms  implicitly  assume 
temporal  symmetry  of  the  data  being  analyzed.  We  have 
shown  that  forward  and  backward  prediction  errors  are 
not  necessarily  equal  in  the  non-Gaussian  case  Conse¬ 
quently,  new  signal  processing  strategies  need  to  be  de¬ 
veloped  to  cope  with  such  data.  For  example,  speech  data 
have  been  subjected  to  linear  predictive  analysis  for  dec¬ 
ades  Assuming  for  the  sake  of  argument  a  stochastic 
model  for  speech  signals,  they  are  decidedly  non-Gauss- 
tan  Since  linear  speech  production  models  seem  ic  cap¬ 
ture  much  of  their  characteristics,  speech  must  be  tem¬ 
porally  asymmetric.  Linear  prediction  algorithms  that 
weight  forward  and  backward  prediction  errors  equally 
can  be  improved  by  consideration  of  the  temporal  asym¬ 
metry  properties  we  have  demonstrated  here  Since  back¬ 
ward  mean- squared  errors  are  always  smaller,  more 
weight  should  be  placed  on  them  in  such  analyses 


Care  must  be  taken  in  applying  the  results  and  concepts 
developed  here  to  non-Gaussian  data.  Many  of  our  results 
have  been  developed  for  first-order  Markov  processes. 
Extension  of  these  ideas  to  higher  order  data  in  paiticulr.r 
must  be  carefully  considered.  A  distribution-free  tech¬ 
nique  for  estimating  model  order  is  described  elsewhere 
[15].  A  logical  extension  of  the  analysis  techniques  would 
be  based  on  higher  order  cond’tional  means.  Such  quan¬ 
tities  require  much  data  to  estimate  and  the  difficulty  of 
the  analysis  increases.  In  these  situations,  higher  order 
correlation  functions  are  the  only  current  alternative; 
however,  for  the  reasons  given  above,  they  are  limited  in 
scope.  New  techniques  based  on  a  fundamental  under¬ 
standing  of  non-Gaussian  processes  are  clearly  needed. 
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